Hello!
Here is a fast note that might not be obvious. Beware with UDF types in PySpark.
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, FloatType
def very_fun(idk):
return(22)
def floating_fun(idk):
return(22.0)
df = sqlContext.createDataFrame(
[
(1, 'foo'),
(2, 'bar'),
],
['id', 'txt']
)
funfun_int = udf(very_fun, IntegerType())
funfun_float = udf(very_fun, FloatType())
floatingfun_int = udf(floating_fun, IntegerType())
floatingfun_float = udf(floating_fun, FloatType())
df = df.withColumn('funfun_int', funfun_int(df['id']))
df = df.withColumn('funfun_float', funfun_float(df['id']))
df = df.withColumn('floatingfun_int', floatingfun_int(df['id']))
df = df.withColumn('floatingfun_float', floatingfun_float(df['id']))
df.show()
And the result is not very amusing:
+---+---+----------+------------+---------------+-----------------+
| id|txt|funfun_int|funfun_float|floatingfun_int|floatingfun_float|
+---+---+----------+------------+---------------+-----------------+
| 1|foo| 22| null| null| 22.0|
| 2|bar| 22| null| null| 22.0|
+---+---+----------+------------+---------------+-----------------+
Conclusion: Know your types. Pyspark UDF is not going to do a cast for you.
Nota: I haven’t tested PandasUDF in this case, but I suppose it’s going to be a bit more creative.