Introduction
Pandas is the most popular library when it comes to working with structured data. The reason behind this is the panda’s powerful tool called DataFrame. A DataFrame is a table where each column represents a different type of data(sometimes called field). The columns have names. Each row represents a record or entity.
An alternative for Pandas that is almost 3 times faster. Polars is one of the lesser-known libraries. Pandas is still one of the best tools out there for data manipulation and analysis, and in no way Polars can replace it, at least for the time being. I just wanted to share this library to make you know about an alternative you can try out.
Working with Polars
Polars can be installed using Pypi using the following code:
pip install polars
Importing libraries
Polars offers many functionalities that are similar to Pandas, so it won’t be a problem for anyone to switch over.
import polars as pl
import matplotlib.pyplot as plt
%matplotlib inline
Loading Dataset
data = pl.read_csv("../sample.csv")
print(type(data))
> <class 'polars.frame.DataFrame'>
Let us start with a basic Data Analysis.
Getting familiar with the dataset
data.shape
> (150930, 11)
data.columns
data.dtypes
data.head()
As you can see this is a huge dataset. with over 11 columns and 150k+ entries, we have a lot of data to analyze. The columns I am interested in are Country, points, and price. Let us see what we can find.
Null Values
Before moving forward we have to take care of the null values if present. We can find the null values easily using null_count().
data.null_count()
Therefore around 13.5k entries are missing values for the price column. We can either drop these rows since it’s less than 10% of the whole dataset, but we can put some other value like the mean:
data['price'] = data['price'].fill_none('mean')
Performing Analysis
Now we dig a little deeper and look into some statistical analysis. This can help us gain some insightful knowledge of the dataset.
Our goal is to compare how price and points vary from country to country.
# Analyses of wine prices
print(f'Median price: {data["price"].median()}')
print(f'Average price: {data["price"].mean()}')
print(f'Maximum price: {data["price"].max()}')
print(f'Minimum price: {data["price"].min()}')
# Analyses of wine points
print(f'Median points: {data["points"].median()}')
print(f'Average points: {data["points"].mean()}')
print(f'Maximum points: {data["points"].max()}')
print(f'Minimum points: {data["points"].min()}')
Thus we can see that a wine can be as cheap as 4 dollars but still have great taste. Now let’s see which countries sell wine.
countries = data['country'].unique().to_list()
print(f'There are {len(countries)} countries in the list')
>There are 49 countries in the list
Scrolling through the dataset, we can see that there are 2 strange values in the column country. These are an undefined country (“”) and another country called ‘US-France’:
print(data[(data['country'] == '') | (data['country'] == 'US-France')])
Since there are just 6 entries with these weird values, so I think it’s safe if we dropped the rows.
data = data[(data['country'] != '') & (data['country'] != 'US-France')]
Now we look into countries which have the best and the costliest wines.
#wines with high points
print(data.groupby('country').select('points').mean().sort(by_column='points_mean', reverse=True))
#Wines which are costly
print(data.groupby('country').select('price').max().sort(by_column='price_max', reverse=True))
Thus we can see that England has one of the best wines, but the costliest one is from France.
Conclusion
If you are interested in having a more in-depth look at the workings of the library, I highly recommend you to read this article by the creator of Polars himself.