Analyse Data With Spark Pool In Azure Synapse Analytics – Part Two

This article is a continuation of Part 1 which I posted earlier. I strongly recommend you to go through part 1 before you go through this article. The demo we are going to see will use apache Spark serverless pool model where we will be loading a parquet sample data file into spark database (yes, we will create a database as well) and then analyze the same using spark from azure notebooks.

The sample parquet file used can be downloaded from the below link.

https://azuresynapsestorage.blob.core.windows.net/sampledata/NYCTaxiSmall/NYCTripSmall.parquet

To start with, make sure you have uploaded the sample file into the ADLS storage. You can refer to my previous articles if you want help or if you have any doubt on how to create a workspace, attach and data lake storage to it and upload a file. For this demo, I have uploaded the sample NYCTripSmall.parquet file under the container of my ADLS storage.

Next step is to create a Serverless Spark Pool. I would not be going through that as I have already covered it in my previous articles.

Analyze sample data with spark pool

As a first step we are going to load the sample data file from storage into spark dataframe using PySpark code. The scripts will be executed in the cells present in the notebook and it offers support for many languages like PySpark, .net(c#), scala etc., for you to run your query.

Go to development tab and select new Notebook. This is where we will see the cells as informed earlier and we have execute the codes.

 

 

You cannot run the query without selecting the spark pool resource. You can find the list of spark pools available in the dropdown to select.

I am using the following code to load the sample file into a data frame and display the first 25 rows. The path you can get is from the sample file properties à ABFSS Path (azure blob file system path)

df = spark.read.load(path='abfss://[email protected]/NYCTripSmall.parquet', format='parquet')

display(df.limit(25))

Once after you hit the execute button. Please note that it will take time to see the results as it is first time execution. The following runs may see improved results.

And as usual azure synapse studio compliments you with chart feature that you can play with.

What if we could load the newly created Dataframe into a database? Into a table inside the database? Let’s see how can we do that.

I run another PySpark query of two lines which will create database and load the dataframe that we created in the previous step into table inside that database.

The query has been executed successfully. Now go to the data table on left side and refresh the workspace tab to see the newly created database and the table.

Analyzing the data

Now let's analyze the stored data using PySpark. I am going to pull down all the data in the table and save it into a new dataframe.

Let's do some aggregations into our newly created table and save the results as a new table call passengercountstats.

As informed earlier the results can be interpreted with the built-in chart types for visualization.

Summary

In this two part article, we saw what is Spark and what are its important concepts. How to load a parquet file into ADLS and query them using serverless spark through synapse studio and then to create a spark database and load tables into it from the dataframe. We also saw how to interpret the table results visually into a bar chart or pie chart without any addons or external applications.

References

Microsoft official documentation.