Introduction
In this article, we will learn about what is the difference between Narrow and Wide transformation. In PySpark, transformations are operations that are applied to a Resilient Distributed Dataset (RDD) or DataFrame to create a new RDD or DataFrame. Transformations can be classified as narrow or wide transformations, depending on how the data is partitioned and distributed across the cluster.
Narrow Transformations
Narrow transformations are operations that can be computed based on a single partition of the input data. In other words, each partition of the parent RDD/DataFrame can be transformed independently, without requiring data from other partitions. Narrow transformations do not require data shuffling or data exchange between partitions. Examples of narrow transformations include map, filter, flatMap, union, and distinct.
Example of a narrow transformation
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("NarrowTransformation").getOrCreate()
# Create a DataFrame
data = [("Loki", 25), ("Ravi", 30), ("Aditya", 35)]
df = spark.createDataFrame(data, ["name", "age"])
# Apply a narrow transformation (map) , Add 5 year in every person age
mapped_df = df.rdd.map(lambda row: (row.name, row.age + 5)).toDF(["name", "new_age"])
mapped_df.show()
Wide Transformations
Wide transformations are operations that require data from multiple partitions of the input data to be combined or shuffled across the cluster. These transformations involve data movement and can be more expensive than narrow transformations. Wide transformations require data shuffling or data exchange between the partitions. Examples of wide transformations include groupByKey, reduceByKey, join, distinct, repartition, and coalesce.
Example of a wide transformation
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("WideTransformation").getOrCreate()
# Create an RDD
data = [("Loki", 1), ("Ravi", 1), ("Aditya", 2), ("Raju", 3), ("Sumit", 4)]
rdd = spark.sparkContext.parallelize(data)
# Apply a wide transformation (groupByKey)- wide Transformation
grouped_rdd = rdd.groupByKey().mapValues(list)
# Collect the result
result = grouped_rdd.collect()
# Print the result
for key, values in result:
print(f"{key}: {values}")
In the above example of the wide transformation, the groupByKey operation requires data from different partitions to be shuffled and combined based on the key. This involves data movement across the cluster, making it a wide transformation.
Summary
It's generally recommended to use narrow transformations whenever possible, as they are more efficient and avoid unnecessary data movement. Wide transformations should be used very carefully, as they can be more expensive and may impact performance, mainly with large datasets.