Pyspark and UDF types problem

Published 11-02-2020 12:20:05

Hello!

Here is a fast note that might not be obvious. Beware with UDF types in PySpark.

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, FloatType


def very_fun(idk):
    return(22)
    
def floating_fun(idk):
    return(22.0)

df = sqlContext.createDataFrame(
    [
        (1, 'foo'), 
        (2, 'bar'),
    ],
    ['id', 'txt'] 
)    
    
funfun_int = udf(very_fun, IntegerType())
funfun_float = udf(very_fun, FloatType())
    
floatingfun_int = udf(floating_fun, IntegerType())
floatingfun_float = udf(floating_fun, FloatType())

df = df.withColumn('funfun_int', funfun_int(df['id']))
df = df.withColumn('funfun_float', funfun_float(df['id']))

df = df.withColumn('floatingfun_int', floatingfun_int(df['id']))
df = df.withColumn('floatingfun_float', floatingfun_float(df['id']))

df.show()

And the result is not very amusing:

+---+---+----------+------------+---------------+-----------------+
| id|txt|funfun_int|funfun_float|floatingfun_int|floatingfun_float|
+---+---+----------+------------+---------------+-----------------+
|  1|foo|        22|        null|           null|             22.0|
|  2|bar|        22|        null|           null|             22.0|
+---+---+----------+------------+---------------+-----------------+

Conclusion: Know your types. Pyspark UDF is not going to do a cast for you.

Nota: I haven’t tested PandasUDF in this case, but I suppose it’s going to be a bit more creative.