SparkR collect method crashes with OutOfMemory on Java heap space
With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0. SparkR is intalled on each machine, and basic tests are working on small files. Here is the script I use : Sys.setenv("SPARK_MEM" = "1g") sc <- sparkR.init("spark://xxxx:7077", sparkEnvir=list(spark.executor.memory="1g")) lines <- textFile(sc, "gs://xxxx/dir/") test <-