Create Synapse Notebook And Run Python And SQL Under Spark Pool

In today's article we will look into how we could run both Python and SparkSQL queries in a single notebook workspace under the built-in Apache Spark Pools to transform the data in a single window.

Introduction

In Azure synapse analytics, a notebook is where you can write live code, visualize and also comment text on them. It lets you test and get the output of each cells rather than executing the entire script. The initial setup is very easy and you can leverage the built-in security features which helps your data stay secure. The notebooks can process across multiple data formats like RAW(CSV, txt JSON), Processed(parquet, delta lake, orc), and SQL(tabular data files against spark & SQL) formats. Apart from all the above benefits the built-in data visualization feature saves a lot of time and comes handy when dealing with subsets of data.

Notebooks can support multiple languages in different cells of a single notebook by specifying the magic commands at the start of the cell.

Magic command Language Description
%%pyspark Python Execute a Python query against Spark Context.
%%spark Scala Execute a Scala query against Spark Context.
%%sql SparkSQL Execute a SparkSQL query against Spark Context.
%%csharp .NET for Spark C# Execute a .NET for Spark C# query against Spark Context.

Steps

This demo will be run on spark pool which has to be created first. Verify if the spark pool is already available in case you have done some labs earlier or create a new one. The spark pool is similar to cluster that we create to run the queries, here in this demo ‘synsparkpool’ is the apache spark pool we are going to use for running the queries.

Go to the development tab from the left side and create a new notebook as below.

Once created you can enter and query results block by block as you would do in Jupyter for python queries. Make sure the newly created notebook is attached to the spark pool which we created in the first step. You will also have an option to change the query language between pyspark, scala, c# and sparksql from the Language dropdown option.

From the first cell let’s try to create a PySpark data frame and display the results. I am going to generate some dummy data for this demo. Synapse notebooks comes with Intellisense loaded which autocompletes the keywords when you are typing the first letter in the cells.

Click the play button on top left to execute the queries in the cell. Please note that it might take some time to start the session when executed the first time, it took me 4.22 Mins to complete this but it is normal.

Now as the dataframe is created let's save them into temp table as you cannot reference data or variables directly across different languages.

See the above image where we are querying the dataframe using SQL query in the same session with another cell which proves that it is not restricted to only Python or Scala only.

Not only with few lines of sample data, most of the times you will have to import a complete CSV or Excel data from storage location directly onto the dataframe which can later be queried through SQL. In the next step, we will demo how to import a huge amount of data.

I am going to load the ‘Financial_Sample_Jan2021_MINI.csv’ file saved in my ADLS Gen2 storage into a spark dataframe. For this, I am copying the abfss path from file properties

Create Synapse Notebook and run Python and SQL under Spark Pool

Using the below spark query I have read the CSV file data into the dataframe now.

dataframe_2 = spark.read.csv("abfss://[email protected]/Financial_Sample_Jan2021_MINI.csv",header=True)
dataframe_2.show()

The same dataframe can be used to create a temp table or view and then queried through SQL. Please note that if you try to query the dataframe directly from SQL you will get the following error.

Summary

This is a basic and simple exercise to show how synapse notebook helps query multiple languages in a single window. One can use python to load/transform and SQL language query to query the same data from temp table.

Reference

Microsoft official documentation