Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset

I am trying to use spark_apply to run the R function below on a Spark table. This works fine if my input table is small (e.g. 5,000 rows), but after ~30 mins throws an error when the table is moderately large (e.g. 5,000,000 rows): sparklyr worker rscript failure, check worker logs for details

Looking at the Spark UI shows that there is only ever a single task being created, and a single executor being applied to this task.

Can anyone give advice on why this function is failing for 5 million row dataset? Could the problem be that a single executor is being made to do all the work, and failing?

# Create data and copy to Spark
testdf <- data.frame(string_id=rep(letters[1:5], times=1000), # 5000 row table
                 string_categories=rep(c("", "1", "2 3", "4 5 6", "7"), times=1000))
testtbl <- sdf_copy_to(sc, testdf, overwrite=TRUE, repartition=100L, memory=TRUE)

# Write function to return dataframe with strings split out
myFunction <- function(inputdf){
  inputdf$string_categories <- as.character(inputdf$string_categories)
  inputdf$string_categories=with(inputdf, ifelse(string_categories=="", "blank", string_categories))
  stringCategoriesList <- strsplit(inputdf$string_categories, ' ')
  outDF <- data.frame(string_id=rep(inputdf$string_id, times=unlist(lapply(stringCategoriesList, length))),
                  string_categories=unlist(stringCategoriesList))
 return(outDF)
}

# Use spark_apply to run function in Spark
outtbl <- testtbl %>%
  spark_apply(myFunction,
          names=c('string_id', 'string_categories'))
outtbl

The sparklyr worker rscript failure, check worker logs for details error is written by the driver node and points out that the actual error needs to be found in the worker logs. Usually, the worker logs can be accessed by opening stdout from the executor's tab in the Spark UI; the logs should contain RScript: entries describing what the executor is processing and the specific of the error.
Regarding the single task being created, when columns are not specified with types in spark_apply(), it needs to compute a subset of the result to guess the column types, to avoid this, pass explicit column types as follows:

outtbl <- testtbl %>% spark_apply( myFunction, columns=list( string_id = "character", string_categories = "character"))
If using sparklyr 0.6.3, update to sparklyr 0.6.4 or devtools::install_github("rstudio/sparklyr"), since sparklyr 0.6.3 contains an incorrect wait time in some edge cases where package distribution is enabled and more than one executor runs in each node.
Under high load, it is common to run out of memory. Increasing the number of partitions could resolve this issue since it would reduce the total memory required to process this dataset. Try running this as:

testtbl %>% sdf_repartition(1000) %>% spark_apply(myFunction, names=c('string_id', 'string_categories'))
It could also be the case that the function throws an exception for some of the partitions due to logic in the function, you could see if this is the case by using tryCatch() to ignore the errors and then find which are the missing values and why the function would fail for those values. I would start with something like:

myFunction <- function(inputdf){ tryCatch({ inputdf$string_categories <- as.character(inputdf$string_categories) inputdf$string_categories=with(inputdf, ifelse(string_categories=="", "blank", string_categories)) stringCategoriesList <- strsplit(inputdf$string_categories, ' ') outDF <- data.frame(string_id=rep(inputdf$string_id, times=unlist(lapply(stringCategoriesList, length))), string_categories=unlist(stringCategoriesList)) return(outDF) }, error = function(e) { return( data.frame(string_id = c(0), string_categories = c("error")) ) }) }

来源：https://stackoverflow.com/questions/46396736/sparklyrs-spark-apply-function-seems-to-run-on-single-executor-and-fails-on-mod

标签

apache-spark

sparklyr