What is Colab?
Colab, or "Colaboratory", allows you to write and execute Python in your browser, with
- Zero configuration is required.
- Access to GPUs free of charge
- Easy sharing
Whether you're a student, a data scientist, or an AI researcher, Colab can make your work easier.
What is PySpark?
PySpark is a python-based API that is used for Spark. It is used for collaborating with Spark using APIs written in Python. It also supports Spark’s features like Spark DataFrame, Spark SQL, Spark Streaming, Spark MLlib, and Spark Core. It provides an interactive PySpark shell to analyze structured and semi-structured data in a distributed environment. PySpark supports reading data from multiple sources and different formats. It also facilitates the use of RDDs (Resilient Distributed Datasets). PySpark features are implemented in the py4j library in python.
![Getting Started PySpark]()
Advantage
- Easy to use, learn and implement.
- Simple and Comprehensive API
- Support ANSI SQL
- Supports Spark, Yarn, and Mesos cluster managers.
- It features various options for data visualization, which is difficult using Scala or Java.
- Immutable.
- Dynamic in nature
- Error Handling
Disadvantage
- Sometimes, it becomes difficult to express problems using the MapReduce model.
- Since Spark was originally developed in Scala while using PySpark in Python programs they are relatively less efficient and approximately 10x times slower than the Scala programs. This would impact the performance of heavy data processing applications.
- The Spark Streaming API in PySpark is not mature when compared to Scala. It still requires improvements.
- PySpark cannot be used for modifying the internal function of the Spark due to the abstractions provided. In such cases, Scala is preferred.
Getting Started
In this article, I am going to use Google Colab.
- Open the given URL https://colab.research.google.com/
- Sign in using your Gmail email address.
- Click the new notebook in the File tab to start a new notebook.
![Getting Started PySpark]()
Here is my sample .csv data file screenshot to give an overview of how columns and data look like.
![Getting Started PySpark]()
Easy way of Installing PySpark in Colab
The easiest way of installing PySpark on Google Colab is to use pip install.
![Getting Started PySpark]()
After installation, we can create a Spark session and check its information.
![Getting Started PySpark]()
We can also test the installation by importing a Spark library.
![Getting Started PySpark]()
Now include the sample data file into the Colab notebook.
![Getting Started PySpark]()
Now let’s load the CSV file data.
![Getting Started PySpark]()
Let’s do some filtering and add some new columns with some logic.
![Getting Started PySpark]()
Filter data
![Getting Started PySpark]()
PySpark Data Frame
PySpark DataFrames are data organized in tables that have rows and columns. Every column in its two-dimensional structure has values for a specific variable, and each row contains a single set of values from each column and names of columns cannot be ignored, Row names need to be unique, and the data that is stored can be character, numeric, or factor data types and there must be an equal number of data items in each column.
Let’s display data in the data Frame.
![Getting Started PySpark]()
You can see column data types by using this command,
![Getting Started PySpark]()
Some other useful commands like first, describe, and counts,
![Getting Started PySpark]()
Handle duplicate and null values,
![Getting Started PySpark]()
If there are any null values, then delete an entire row from the given data result,
![Getting Started PySpark]()
Another way
![Getting Started PySpark]()
Let’s play some with select data,
![Getting Started PySpark]()
Rename column name,
![Getting Started PySpark]()
Conclusion
In this article, we have learned how to setup Colab and install PySpark and run some data commands.