问题
In Java/Scala/Python implementations of Spark, one can simply call the foreach
method of RDD
or DataFrame
types in order to parallelize the iterations over a dataset.
In SparkR I can't find such instruction. What would be the proper way to iterate over the rows of a DataFrame
?
I could only find the gapply
and dapply
functions, but I don't want to calculate new column values, I just want to do something by taking one element from a list, in parallel.
My previous attempt was with lapply
inputDF <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "")
createOrReplaceTempView(inputDF,'inputData')
distinctM <- sql('SELECT DISTINCT(ID_M) FROM inputData')
collected <- collect(distinctM)[[1]]
problemSolver <- function(idM) {
filteredDF <- filter(inputDF, inputDF$ID_M == idM)
}
spark.lapply(c(collected), problemSolver)
but I'm getting this error:
Error in handleErrors(returnStatus, conn) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 1 times, most recent failure: Lost task 1.0 in stage 5.0 (TID 207, localhost, executor driver): org.apache.spark.SparkException: R computation failed with
Error in callJMethod(x@sdf, "col", c) :
Invalid jobj 3. If SparkR was restarted, Spark operations need to be re-executed.
Calls: compute ... filter -> $ -> $ -> getColumn -> column -> callJMethod
What would be the solution provided by R to solve such problems?
回答1:
I had a similar problem as well. Collecting a DataFrame puts it into R as a dataframe. From there, you can get at each row as you normally would in regular old R. In my opinion, this is a horrible motif for processing data as you lose the parallel processing Spark provides. Instead of collecting the data and then filtering, use the built in SparkR functions, select
, filter
,etc. If you wish to do row-wise operators, the built in SparkR functions will generally do this for you, otherwise, I have found selectExpr
or expr
to be very useful when the original Spark functions are designed to work on a single value (think: from_unix_timestamp)
So, to get what you want, I would try something like this (I'm on SparkR 2.0+):
Frist Read in the data as you have done:
inputDF<- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "")
Then make this an RDD:inputSparkDF<- SparkR::createDataFrame(inputDF)
Next, isolate only the distinct/unique values (I'm using magrittr for piping (works in SparkR)):
distinctSparkDF<- SparkR::select(inputSparkDF) %>% SparkR::distinct()
From here, you can apply filtering while still living in Spark's world:
filteredSparkDF<- SparkR::filter(distinctSparkDF, distinctSparkDF$variable == "value")
After Spark has filtered that data for you, it makes sense to collect the subset into base R as the last step in the workflow:
myRegularRDataframe<- SparkR::collect(filteredSparkDF)
I hope this helps. Best of luck. --nate
来源:https://stackoverflow.com/questions/41816328/sparkr-foreach-loop