How best to handle converting a large local data frame to a SparkR data frame?

孤人 提交于 2019-12-21 20:35:50

问题


How can I convert a large local data frame to a SparkR data frame efficiently? On my local dev machine an ~ 650MB local data frame quickly exceeds available memory when I try to convert it to a SparkR data frame and I have a dev machine with 40GB of Ram.

library(reshape2)

years <- sample(1:10, 100, replace = T)
storms <- sample(1:10, 100, replace = T)
wind_speeds <- matrix(ncol = 316387, nrow = 100, 
                     data = sample(0:250, 31638700, replace = T))

df <- data.frame(year=years, storm=storms, ws = wind_speeds)
df <- melt(df, id.vars = c('year', 'storm'))

Sys.setenv(SPARK_HOME = "/home/your/path/spark-2.0.0-bin-hadoop2.7")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "10g"))

spark_df <- as.DataFrame(df) #This quickly exceeds available memory 

回答1:


I'm still very interested in an answer to this but wanted to post my work around.

My end goal was to convert 5,000 large binary files into the parquet format so that the data could be query-able. I had intended to iterate over it serially and use the Spark write.parquet function and then ran into my problem that generated this question. For whatever reason Spark couldn't convert a 650MB local data frame to a SparkR distributed data frame without running out of memory (40 GB on my dev box).

What I did for my work-around:

  • Use SparkR to convert the 5,000 binary files to CSV in parallel using spark.lapply to call my conversion function

  • Use Apache Drill to convert the CSV files to the parquet format

  • This was ~ 3.5TB of data uncompressed as CSV files and ended up ~350 GB in the parquet format



来源:https://stackoverflow.com/questions/39392327/how-best-to-handle-converting-a-large-local-data-frame-to-a-sparkr-data-frame

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!