UDFs (User Defined Functions) are nothing but code that can be reused in our Notebook. It is similar to User-defined functions in our typical RDBMS.
In Spark, we can create a function in a Python/Scala syntax and wrap it with udf() or register it as udf and use it on DataFrame and SQL.
Use Case of a UDF
If we want to convert every first letter of a word in a name string to a capital case, PySpark's built-in features don't have this function. Hence we can create a UDF and reuse this as needed on many Data Frames. UDFs are created once and can be reused on several DataFrame's and SQL expressions.
Before we create any UDF, we need to check if the similar function we need is already available in the built in Spark SQL Functions. PySpark SQL provides several predefined common functions and many more new functions are added with every release. hence, It is best to check before you reinventing the wheel.
While creating UDF’s we need to design them very carefully otherwise we will come across optimization & performance issues.
Steps to create UDF
To create a UDF, we first need to create the function in our desired programming language. Here, I am creating a function in PySpark.
def convertCase(str):
resStr=""
arr = str.split(" ")
for x in arr:
resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr
The above function will get the string as input and convert the first letter into upper case. Then we need to register our function using the below code,
convertUDF = udf(lambda z: convertCase(z),StringType())
Now we can use the UDF convertUDF in our dataframe to convert the first letter of the string to upper case.
df.select(col("ID"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)
Registering PySpark UDF & use it on SQL
To use convertCase()
function on PySpark SQL, we need to register the function with PySpark by using spark.udf.register()
.
spark.udf.register("convertUDF", convertCase,StringType())
Things to be Noted when we create a UDF
Execution order
The expressions are not guaranteed to be evaluated left to right or in any other fixed order. PySpark reorders the execution for query optimization and planning. Hence, AND, OR, WHERE and HAVING expressions will have side effects.
Handling Null Checks
UDFs are error-prone when not designed carefully. For example, when we have a column that contains the value null
on some records, it will throw an AttributeError.
Its always best practice to check for null inside a UDF function rather than checking for null outside
UDF’s are a black box to PySpark hence it can’t apply optimization and we will lose all the optimization PySpark does on Dataframe/Dataset. When possible we should use Spark SQL built-in functions as these functions provide optimization. Consider creating UDF only when existing built-in SQL function doesn’t have it.