Introduction
PySpark is a Python API for Apache Spark, PySpark helps in interfacing with Resilient Distributed Datasets (RDDs) in Apache, PySpark supports most of Spark’s features such as Spark SQL, Streaming, MLlib, and Spark Core.
The article introduces working with PySpark and highlights some of the important fundamentals of working with PySpark.
Major Differences between PySpark and Pandas
There are a few important differences between PySpark and Pandas, if you have worked on Pandas but not on PySpark, it might be confusing as to why to work on PySpark when we have Pandas.
Pandas is great when the need is to analyze the small dataset on a single machine, but when the need is to analyze much bigger datasets PySpark should be the choice because Spark RDDs are distributed whereas Pandas DataFrame is not distributed. Processing in Pandas is much slower for a huge amount of data.
Spark DataFrame supports parallelization whereas Pandas doesn’t as it’s for single machine tool.
Spark DataFrame supports fault tolerance whereas Pandas DataFrame doesn’t support it.
Spark DataFrame is Immutable whereas Pandas DataFrame is mutable.
Installation
While working with PySpark on the local machine we don’t have to additionally install Spark, the ‘pip install pyspark’ command will install Spark as well.
In the article we are using Jupyter Notebook for sample examples, the above image showcases how to install PySpark.
Spark Session
SparkSession is the entry point to PySpark, through SparkSession underlying PySpark functionality can be accessed, like creating RDD, Dataset, etc. The below code highlights how to create a SparkSession
from pyspark.sql import SparkSession
sc = SparkSession.builder.master("local[*]")\
.appName('pyspark_session')\
.getOrCreate()
master(): If we are running on the cluster, we need to use the master’s name as an argument like ‘mesos’. Since we are running on local, we are using the ‘local’ keyword which is used to run spark locally and * means for all threads in a machine.
appName(): Name of our application
getOrCreate(): this method returns an existing SparkSession, if not then it creates one.
Reading Data
A simple CSV file is created with the name ‘users.csv’ which has the following contents and we will read the same file through PySpark.
Just like Pandas, PySpark can read files from various formats like ‘CSV’, ‘.txt’, ’JSON’ etc. We must use the ‘spark.read()’ method for reading the file. In our case, we are reading a .csv file so the spark.read.csv method should be used.
df = sc.read.csv('users.csv')
df
returns
type(df)
To see the entire data set, we can leverage the show() method
df.show()
Since these columns do not make sense, we can rename the columns using the ‘withColumnRenamed’ method.
df = df.withColumnRenamed('_c0', 'Name').withColumnRenamed('_c1', 'Age')
df.show()
Summary
The article is very basic fundamentals of PySpark and how to work with it, the article also explains the difference between PySpark and Pandas. The upcoming articles on PySpark will provide more in-depth insights.