This post explains how you can effectively union two tables or data frames in databricks.
There are different methods to handle the union and this post explains how you can leverage the native spark statement to get the expected result.
When we start thinking about union, we need to make sure the column names on both tables are the same so that we will get the expected result. What will happen if the columns are different and in a different order?
Let us see the simple result using UNION in spark.sql.
I have a simple data frame defined for this demo.
df1 = spark.createDataFrame([[1, "John", "$25000"]], ["ID", "Name", "Sales"])
df2 = spark.createDataFrame([[2, "Bob", "Marketing"]], ["ID", "Name", "Department"])
Let us see the spark.sql with the union for the above 2 data frames.
I am creating a view out of the above 2 data frames to use the SQL syntax in the union statement.
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
spark.sql("select * from df1 union select * from df2").show()
Check the result below,
If you see the result, it is simply doing the union of both datasets. As you know, if we try with a different number of columns then it will through an error like below,
To achieve the correct result, we need to place the column incorrect order and add missing columns with null values.
Let us come to the point, I am going to show how you can achieve the correct result easily.
In spark, we have a function called “unionByName”. This will help to get the expected result.
Let us go with the same sample and start to use the unionByName function. Here, we do not need to create a view and play with SQL.
The above syntax failed because it was looking for the same column names. Let us add a property called “allowMissingColumns=True” and check.
df1.unionByName(df2,allowMissingColumns=True).show()
The above syntax failed because it was looking for the same column names. Let us add a property called “allowMissingColumns=True” and check.
We got the expected result in one step. The other beauty is it will show the correct results even the column name are in a different order.
This is something similar to UNION ALL in terms of the result. If you have matching values in both the dataset then it will show both the values. If you need a distinct of the result then use distinct() with the function as like below,
df1.unionByName(df2,allowMissingColumns=True).distinct().show()
Let me know if you are doing this in a different way in the comment below.
Happy Learning!!