Introduction
Analyzing data often requires grouping and summarizing it to uncover trends and patterns. In this article, we’ll look at two powerful functions, ROLLUP and CUBE, in Microsoft Fabric’s Spark environment and show how they can be used to explore the NYC Taxi dataset. We’ll walk you through simple PySpark examples and explain when to use each function based on your needs.
Microsoft Fabric makes working with big data using Spark straightforward. With functions like ROLLUP and CUBE, you can quickly summarize key metrics, turning complex datasets into insights that drive better decisions.
Obtaining the NYC Taxi data
To follow along, you'll need the NYC Taxi dataset, which is available in Azure Open Datasets. This dataset provides comprehensive information about taxi trips in New York City, including pickup dates, payment types, and fare amounts. Using Azure Open Datasets makes accessing and preparing the data quick and seamless within Microsoft Fabric.
Step 1. Access the NYC Taxi dataset. We are going to save the 2018 data in the 'mycTripYellow' table.
from pyspark.sql.functions import *
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "yellow"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (
blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token
)
print('Remote blob path: ' + wasbs_path)
# SPARK read parquet, note that it won't load any data yet by now
srcDf = spark.read.parquet(wasbs_path)
# Only want data in 2018 - Q4
taxiDf = srcDf.filter((col('puYear') == 2018) & (col('puMonth').between(10, 12)))
Step 2. Create the Payment Type reference table 'nycTripPaymentType'. This table is used to retrieve the payment description.
ptData = [
(0, "Flex Fare trip"),
(1, "Credit card"),
(2, "Cash"),
(3, "No charge"),
(4, "Dispute"),
(5, "Unknown"),
(6, "Voided trip")
]
ptCols = ["PaymentTypeID", "Description"]
paymentTypeDf = spark.createDataFrame(ptData, schema=ptCols)
Step 3. Join the two datasets together.
taxiJoinDf = taxiDf.join(
paymentTypeDf, taxiDf["paymentType"] == paymentTypeDf["PaymentTypeID"], how="inner"
) \
.select("Description", "puYear", "puMonth", "fareAmount") \
.withColumnRenamed("Description", "paymentTypeDesc")
display(taxiJoinDf.limit(10))
Now, we have the data ready for ROLLUP and CUBE functions.
Using ROLLUP to summarize data
ROLLUP is a smarter version of Group By. While Group By gives us aggregations based on the specified columns, ROLLUP goes one step further by adding a sub-total at each specified column level and a grand total to the overall result.
Imagine you're a store owner analyzing our monthly revenue.
- You want to know the revenue for each payment type (Cash, Credit card) for each month.
- You also want the revenue for the month.
- Then, you want the grand total for Q4.
In the query below, we are using the puMonth and paymentTypeDesc to analyze the monthly revenue using the ROLLUP function. To override the default number formatting, we are enforcing the result to display 2 decimal places.
rollupDf = taxiJoinDf.rollup("puMonth", "paymentTypeDesc") \
.agg(sum("fareAmount").alias("totalRevenue"))
rollupFormatDf = rollupDf.withColumn("totalRevenue", format_number("totalRevenue", 2))
display(rollupFormatDf.orderBy(asc_nulls_last("puMonth"), asc_nulls_last("paymentTypeDesc")))
In the result, we can see the revenue for Payment Type for each Month, followed by the Sub-total for the month and the Q4 grand total. The NULL values represent the totals generated by the ROLLUP query. To keep the data organized, it's important to sort the result so that the NULL values appear in the correct order using the asc_nulls_last function.
Using CUBE to analyze data from all angles
CUBE is a more comprehensive version of ROLLUP. It produces all combinations of the columns instead of the specific combination defined. Similar to ROLLUP, it produces sub-total and grand-total as well. CUBE is perfect for cases where you need to explore your data from all angles
Imagine you want to know the revenue generated for each payment type across all the months as well as what ROLLUP gives you, and then the CUBE function should be used.
In this query, we are using the same columns to analyze the revenue. The code looks almost identical, except we are using the CUBE function instead.
cubeDf = taxiJoinDf.cube("puMonth", "paymentTypeDesc") \
.agg(sum("fareAmount").alias("totalRevenue"))
cubeFormatDf = cubeDf.withColumn("totalRevenue", format_number("totalRevenue", 2))
display(cubeFormatDf.orderBy(asc_nulls_last("puMonth"), asc_nulls_last("paymentTypeDesc")))
The result shows the new rows the CUBE function created. It contains the Total Revenue generated by each Payment Type in Q4. This gives us every possible grouping to help us complete the analysis.
References
Summary
In this article, we explored the ROLLUP and CUBE functions, two powerful aggregation techniques in Spark. They can be used in Microsoft Fabric Notebook to analyze your data.
Key Differences Between CUBE and ROLLUP
Feature |
ROLLUP |
CUBE |
Purpose |
Create Sub-total and Grand-total based on specified columns (in order) |
All possible combinations of the columns |
Best Use Case |
Reports and Dashboards with structured and well-defined order |
Exploratory and multi-dimensional views |
Output |
Sub-total for the defined hierarchy |
Sub-total for all combinations |
Finally, both ROLLUP and CUBE are expansive operations. To improve performance, keep the number of rows and columns to a minimum. This is done by applying the filter and select functions. In our dataset, we filtered the data to 2018 Q4 and selected the puYear, puMonth, paymentTypeDesc, and fareAmount columns only.
Happy Learning!