Pyspark and UDF types problem

Published 11-02-2020 12:20:05


Here is a fast note that might not be obvious. Beware with UDF types in PySpark.

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, FloatType

def very_fun(idk):
def floating_fun(idk):

df = sqlContext.createDataFrame(
        (1, 'foo'), 
        (2, 'bar'),
    ['id', 'txt'] 
funfun_int = udf(very_fun, IntegerType())
funfun_float = udf(very_fun, FloatType())
floatingfun_int = udf(floating_fun, IntegerType())
floatingfun_float = udf(floating_fun, FloatType())

df = df.withColumn('funfun_int', funfun_int(df['id']))
df = df.withColumn('funfun_float', funfun_float(df['id']))

df = df.withColumn('floatingfun_int', floatingfun_int(df['id']))
df = df.withColumn('floatingfun_float', floatingfun_float(df['id']))

And the result is not very amusing:

| id|txt|funfun_int|funfun_float|floatingfun_int|floatingfun_float|
|  1|foo|        22|        null|           null|             22.0|
|  2|bar|        22|        null|           null|             22.0|

Conclusion: Know your types. Pyspark UDF is not going to do a cast for you.

Nota: I haven’t tested PandasUDF in this case, but I suppose it’s going to be a bit more creative.