Pyspark and UDF types problem - 2020-11-02 12:20:05

Hello! Here is a fast note that might not be obvious. Beware with UDF types in PySpark. from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, FloatType def very_fun(idk): return(22) def floating_fun(idk): return(22.0) df = sqlContext.createDataFrame( [ (1, 'foo'), (2, 'bar'), ], ['id', 'txt'] ) funfun_int = udf(very_fun, IntegerType()) funfun_float = udf(very_fun, FloatType()) floatingfun_int = udf(floating_fun, IntegerType()) floatingfun_float = udf(floating_fun, FloatType()) df = df.withColumn('funfun_int', funfun_int(df['id'])) df = df.withColumn('funfun_float', funfun_float(df['id'])) df = df.

SparkR gapply mess - 2017-05-12 08:56:31

Hello, Do not assume anything. Never. Ever. Specially with SparkR (Apache Spark 2.1.0). When using the gapply function, maybe you want to return the key to mark the results in a function as follows: countRows <- function(key, values) { df <- data.frame(key=key, nvalues=nrow(values)) return(df) } count <- gapplyCollect(data, "keyAttribute", countRows) countRows <- function(key, values) { df <- data.frame(key=key, nvalues=nrow(values)) return(df) } count <- gapplyCollect(data, "keyAttribute", countRows) SURPRISE. You can’t. You should get this error: