UNION In PySpark SQL

This post explains how you can effectively union two tables or data frames in databricks.

There are different methods to handle the union and this post explains how you can leverage the native spark statement to get the expected result.

When we start thinking about union, we need to make sure the column names on both tables are the same so that we will get the expected result. What will happen if the columns are different and in a different order?

Let us see the simple result using UNION in spark.sql.

I have a simple data frame defined for this demo.

df1 = spark.createDataFrame([[1, "John", "$25000"]], ["ID", "Name", "Sales"])
df2 = spark.createDataFrame([[2, "Bob", "Marketing"]], ["ID", "Name", "Department"])

Let us see the spark.sql with the union for the above 2 data frames.

I am creating a view out of the above 2 data frames to use the SQL syntax in the union statement.

df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
spark.sql("select * from df1 union select * from df2").show()

Check the result below,

UNION In PySpark SQL

If you see the result, it is simply doing the union of both datasets. As you know, if we try with a different number of columns then it will through an error like below,

UNION In PySpark SQL

To achieve the correct result, we need to place the column incorrect order and add missing columns with null values.

UNION In PySpark SQL

Let us come to the point, I am going to show how you can achieve the correct result easily.

In spark, we have a function called “unionByName”. This will help to get the expected result.

Let us go with the same sample and start to use the unionByName function. Here, we do not need to create a view and play with SQL.

UNION In PySpark SQL

The above syntax failed because it was looking for the same column names. Let us add a property called “allowMissingColumns=True” and check.

df1.unionByName(df2,allowMissingColumns=True).show()

The above syntax failed because it was looking for the same column names. Let us add a property called “allowMissingColumns=True” and check.

UNION In PySpark SQL

We got the expected result in one step. The other beauty is it will show the correct results even the column name are in a different order.

UNION In PySpark SQL

This is something similar to UNION ALL in terms of the result. If you have matching values in both the dataset then it will show both the values. If you need a distinct of the result then use distinct() with the function as like below,

df1.unionByName(df2,allowMissingColumns=True).distinct().show()

UNION In PySpark SQL

Let me know if you are doing this in a different way in the comment below.

Happy Learning!!