SparkR gapply mess

Published 2017-05-12 08:56:31

Hello,

Do not assume anything. Never. Ever. Specially with SparkR (Apache Spark 2.1.0).

When using the gapply function, maybe you want to return the key to mark the results in a function as follows:

countRows <- function(key, values) {
    df <- data.frame(key=key, nvalues=nrow(values))
    return(df)
}   

count <- gapplyCollect(data, "keyAttribute", countRows)
countRows <- function(key, values) {
    df <- data.frame(key=key, nvalues=nrow(values))
    return(df)
}   
count <- gapplyCollect(data, "keyAttribute", countRows)

SURPRISE. You can’t.

You should get this error:

Error in match.names(clabs, names(xi)): names do not match previous names

Well, that’s weird. Why is this happening?

Actually, key is a list because you can specify more than one column, therefore it already has a descriptor name which overwrites the one you specify, producing that two different keys have two different names. An easy way to fix this is just to unlist the key

countRows <- function(key, values) {
    df <- data.frame(key=unlist(key), nvalues=nrow(values))
    return(df)
}   

count <- gapplyCollect(data, "keyAttribute", countRows)

MIND THAT THIS DOES NOT WORK WHEN YOU USE MORE THAN ONE COLUMNS FOR GROUPING!