Introduction
In this article, we are going to see how aggregation works in Pandas. There are various functions available in Panda’s library which are simple to understand and apply, whatever mathematical calculations we want to perform are available in Pandas. It’s difficult to cover all the functions in the article and some of them are very similar or straightforward, the article covers some of the important ones, let’s have a look.
Setup
Setup is very similar as it’s in my other panda's article’s on C# Corner, We will work on a Kaggle dataset that provides YouTube video trending statistics, URL: https://www.kaggle.com/datasnaek/youtube-new and the file we are using is ‘USvideos.csv’ for this article.
df = pd.read_csv('USvideos.csv')
df.columns
The columns of the data set are,
Let’s understand by example, first, we will sort the given DataFrame in descending order of the number of ‘likes’ by users.
likesdf = df.sort_values(by='likes', ascending=False)
likesdf.head()
In the ‘likesdf’ DataFrame there are many columns like ‘publish_time’, ‘comments’ etc, let’s fetch all the numeric columns so that easier to apply aggregation functions.
newlikesdf = likesdf.select_dtypes(include=np.number)
newlikesdf.head()
The ‘newlikesdf’ DataFrame now has all the numeric columns like ‘likes’, ‘dislikes’, ‘comment_count’, ‘views’ etc. The 'newlikesdf' DataFrame looks like,
sum
The ‘sum’ function calculates the sum of columns.
newlikesdf.sum()
Since the sum() is applied to the entire DataFrame the sum is calculated on every column, sum function can be applied to individual columns as well.
newlikesdf[‘likes’].sum()
#3041147198
max
The ‘max’ function computes maximum values in every column.
newlikesdf.max()
Just like max, another function ‘min’ is available to compute the minimum value.
mean
The mean function computes the mean values of columns. Mathematically speaking mean is the arithmetic average of set of given numbers. Mean of 3 numbers 1, 2, 3 is = 2.
newlikesdf.mean()
agg
The ‘agg’ function accepts a list of functions that are to be applied, for example in the ‘agg’ function we can pass either sum, min, max, mean to identify the result rather than finding them individually.
newlikesdf.agg(['sum', 'min'])
Let's add 'max' function to the List
newlikesdf.agg(['sum', 'min', 'max'])
std
The ‘std’ function is used to find the Standard deviation of the columns. The Standard Deviation explains how the values are spread across the data sample and it’s the measure of the variation.
newlikesdf.std()
describe
All the functions we have learned so far except ‘agg’, all will be covered in the ‘describe’ function, which is used to fetch the descriptive analysis of the DataFrame.
newlikesdf.describe()
Summary
There are lot more functions available.
- Min – To compute the minimum, just like max
- Count – Calculates the count
- Var – calculates the variance
- Sem – Calculates the Standard Error of the mean.