Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset

我是研究僧i 提交于 2019-12-06 05:29:15
  1. The sparklyr worker rscript failure, check worker logs for details error is written by the driver node and points out that the actual error needs to be found in the worker logs. Usually, the worker logs can be accessed by opening stdout from the executor's tab in the Spark UI; the logs should contain RScript: entries describing what the executor is processing and the specific of the error.

  2. Regarding the single task being created, when columns are not specified with types in spark_apply(), it needs to compute a subset of the result to guess the column types, to avoid this, pass explicit column types as follows:

    outtbl <- testtbl %>% spark_apply( myFunction, columns=list( string_id = "character", string_categories = "character"))

  3. If using sparklyr 0.6.3, update to sparklyr 0.6.4 or devtools::install_github("rstudio/sparklyr"), since sparklyr 0.6.3 contains an incorrect wait time in some edge cases where package distribution is enabled and more than one executor runs in each node.

  4. Under high load, it is common to run out of memory. Increasing the number of partitions could resolve this issue since it would reduce the total memory required to process this dataset. Try running this as:

    testtbl %>% sdf_repartition(1000) %>% spark_apply(myFunction, names=c('string_id', 'string_categories'))

  5. It could also be the case that the function throws an exception for some of the partitions due to logic in the function, you could see if this is the case by using tryCatch() to ignore the errors and then find which are the missing values and why the function would fail for those values. I would start with something like:

    myFunction <- function(inputdf){ tryCatch({ inputdf$string_categories <- as.character(inputdf$string_categories) inputdf$string_categories=with(inputdf, ifelse(string_categories=="", "blank", string_categories)) stringCategoriesList <- strsplit(inputdf$string_categories, ' ') outDF <- data.frame(string_id=rep(inputdf$string_id, times=unlist(lapply(stringCategoriesList, length))), string_categories=unlist(stringCategoriesList)) return(outDF) }, error = function(e) { return( data.frame(string_id = c(0), string_categories = c("error")) ) }) }

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!