How best to handle converting a large local data frame to a SparkR data frame?

问题

How can I convert a large local data frame to a SparkR data frame efficiently? On my local dev machine an ~ 650MB local data frame quickly exceeds available memory when I try to convert it to a SparkR data frame and I have a dev machine with 40GB of Ram.

library(reshape2)

years <- sample(1:10, 100, replace = T)
storms <- sample(1:10, 100, replace = T)
wind_speeds <- matrix(ncol = 316387, nrow = 100, 
                     data = sample(0:250, 31638700, replace = T))

df <- data.frame(year=years, storm=storms, ws = wind_speeds)
df <- melt(df, id.vars = c('year', 'storm'))

Sys.setenv(SPARK_HOME = "/home/your/path/spark-2.0.0-bin-hadoop2.7")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "10g"))

spark_df <- as.DataFrame(df) #This quickly exceeds available memory

回答1:

I'm still very interested in an answer to this but wanted to post my work around.

My end goal was to convert 5,000 large binary files into the parquet format so that the data could be query-able. I had intended to iterate over it serially and use the Spark write.parquet function and then ran into my problem that generated this question. For whatever reason Spark couldn't convert a 650MB local data frame to a SparkR distributed data frame without running out of memory (40 GB on my dev box).

What I did for my work-around:

Use SparkR to convert the 5,000 binary files to CSV in parallel using spark.lapply to call my conversion function
Use Apache Drill to convert the CSV files to the parquet format
This was ~ 3.5TB of data uncompressed as CSV files and ended up ~350 GB in the parquet format

来源：https://stackoverflow.com/questions/39392327/how-best-to-handle-converting-a-large-local-data-frame-to-a-sparkr-data-frame

标签

sparkr