Spark - Scope, Data Frame, and memory management

空扰寡人 提交于 2021-02-11 12:41:43

问题


I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput to disk.

val files = getListOfFiles("outputs/emailsSplit")

for (file <- files){

   val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("delimiter","\t")          // Delimiter is tab
      .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
      .schema(customSchema)              // Schema of the table
      .load(file.toString)                        // Input file


   val dfOutput = df.[stuff happens]

    dfOutput.write.format("com.databricks.spark.csv").mode("overwrite").option("header", "true").save("outputs/sentSplit/sentiment"+file.toString+".csv") 

}
  1. Is each Data Frame inside the for loop discarded when a loop is done, or do they stay in memory?
  2. If they are not discarded, what is a better way to do memory management at this point?

回答1:


DataFrame objects are tiny. However they can reference data in cache on Spark executors, and they can reference shuffle files on Spark executors. When the DataFrame is garbage collected that also causes the cache and shuffle files to be deleted on the executors.

In your code there are no references to the DataFrames past the loop. So they are eligible garbage collection. Garbage collection typically happens in response to memory pressure. If you worry about shuffle files filling up disk, it may make sense to trigger an explicit GC to make sure shuffle files are deleted for DataFrames that are no longer references.

Depending on what you do with the DataFrame ([stuff happens]) it may be that no data is ever stored in memory. This is the default mode of operation in Spark. If you just want to read some data, transform it, and write out back out, it will all happen line-by-line, never storing any of it in memory. (Caching only happens when you explicitly ask for it.)

With all that, I suggest not worrying about memory management until you have problems.



来源:https://stackoverflow.com/questions/38023349/spark-scope-data-frame-and-memory-management

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!