Introduction
The article explains the pipe function in Pandas, the ‘pipe’ function is a very useful function through which we can chain multiple processing operations into one. In this article, we will look at
- What is the Pipe function?
- How it works
- Understanding by example
Let’s explore
pipe function
The common operation while working with datasets are handling missing values, sorting, dropping duplicates, removing unwanted data, etc., we should create these individual operations in a function and chain all these functions through the ‘pipe’ function.
Syntax
df_final = (df.pipe(functionOne).pipe(functionTwo).pipe(functionThree))
df_final = (df.pipe(handle_missing_values).pipe(sort_df).pipe(drop_duplicates))
pipe function allows chaining together functions that have Series, DataFrame, GroupBy objects as parameters.
Setup
In this article, we look into the same dataset which I have always used in my Pandas articles we will work on a Kaggle dataset that provides YouTube video trending statistics, URL: https://www.kaggle.com/datasnaek/youtube-new and the file we are using is ‘USvideos.csv’.
df = pd.read_csv('USvideos.csv')
df.columns
The columns of the dataset are
Examples of Pipe Function
In the dataset we have multiple columns, we will create a DataFrame which is sorted by the number of ‘likes’ in descending order, then filter out the rows which are liked more than a 1million times, then filter out the videos by a substring of any music video title or music group, resulting in the records we wanted after applying all these steps.
Function One: Sort DataFrame by Descending order of 'likes'
def sortedByLikesInDescendingOrder(dataframe):
return df.sort_values(by='likes', ascending=False)
sortedByLikesInDescendingOrder(df).head()
Function Two: Filter Rows which has more than a million likes
def filterMoreThanMillionLikes(dataframe):
return likesdf[likesdf['likes'] > 1000000]
filterMoreThanMillionLikes(df).head()
Function Three: In this function, we will filter rows where the title is 'BTS'
def filterByTitle(dataframe):
return dataframe[dataframe.title.str.startswith('BTS')]
filterByTitle(df).head()
These three functions are individual functions, now enters the ‘pipe’ function, we can chain all these functions together to get the result.
df_final = (df.pipe(createLikesDf)
.pipe(millionLikes)
.pipe(filterByTitle))
df_final.head()
Let’s validate the results, first validate all the music video titles has String ‘BTS’.
df_final['title']
All the rows have String 'BTS', Validating the number of 'likes'.
df_final['likes']
Summary
What we have explored here is the beauty of Higher Order Functions, Higher-Order Functions, treats functions as a value which is exactly what we did through the ‘pipe’ function. It’s an in-built function that can take care of chaining and it is sufficient, but if required we can create our own and have the customized behavior, we want in our own custom pipe function.