User Defined Function In Spark

Article

UDFs (User Defined Functions) are nothing but code that can be reused in our Notebook. It is similar to User-defined functions in our typical RDBMS.

In Spark, we can create a function in a Python/Scala syntax and wrap it with udf() or register it as udf and use it on DataFrame and SQL.

Use Case of a UDF

If we want to convert every first letter of a word in a name string to a capital case, PySpark's built-in features don't have this function. Hence we can create a UDF and reuse this as needed on many Data Frames. UDFs are created once and can be reused on several DataFrame's and SQL expressions.

Before we create any UDF, we need to check if the similar function we need is already available in the built in Spark SQL Functions. PySpark SQL provides several predefined common functions and many more new functions are added with every release. hence, It is best to check before you reinventing the wheel.

While creating UDF’s we need to design them very carefully otherwise we will come across optimization & performance issues.

Steps to create UDF

To create a UDF, we first need to create the function in our desired programming language. Here, I am creating a function in PySpark.

def convertCase(str):
    resStr=""
    arr = str.split(" ")
    for x in arr:
       resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
    return resStr

The above function will get the string as input and convert the first letter into upper case. Then we need to register our function using the below code,

convertUDF = udf(lambda z: convertCase(z),StringType())

Now we can use the UDF convertUDF in our dataframe to convert the first letter of the string to upper case.

df.select(col("ID"), \
    convertUDF(col("Name")).alias("Name") ) \
   .show(truncate=False)

Registering PySpark UDF & use it on SQL

To use convertCase() function on PySpark SQL, we need to register the function with PySpark by using spark.udf.register().

spark.udf.register("convertUDF", convertCase,StringType())

Things to be Noted when we create a UDF

Execution order

The expressions are not guaranteed to be evaluated left to right or in any other fixed order. PySpark reorders the execution for query optimization and planning. Hence, AND, OR, WHERE and HAVING expressions will have side effects.

Handling Null Checks

UDFs are error-prone when not designed carefully. For example, when we have a column that contains the value null on some records, it will throw an AttributeError.

Its always best practice to check for null inside a UDF function rather than checking for null outside

UDF’s are a black box to PySpark hence it can’t apply optimization and we will lose all the optimization PySpark does on Dataframe/Dataset. When possible we should use Spark SQL built-in functions as these functions provide optimization. Consider creating UDF only when existing built-in SQL function doesn’t have it.