What is the performance difference between accumulator and collect() in Spark?

泄露秘密 提交于 2021-02-10 14:11:31

问题


Accumulator are basically the shared variable in spark to be updated by executors but read by driver only. Collect() in spark is to get all the data into the driver from executors.

So, in both when I am get the data ultimately in driver only. so, what is the difference in performance when we use accumulator or collect() to convert a large RDD into a LIST?

Code to convert dataframe to List using accumulator

val queryOutput = spark.sql(query)
val acc = spark.sparkContext.collectionAccumulator[Map[String,Any]]("JsonCollector")
val jsonString = queryOutput.foreach(a=>acc.add(convertRowToJSON(a)))
acc.value.asScala.toList


def convertRowToJSON(row: Row): Map[String,Any] = {
    val m = row.getValuesMap(row.schema.fieldNames)
    println(m)
    JSONObject(m).obj
  }

Code to convert dataframe to list using collect()

val queryOutput = spark.sql(query)
queryOutput.toJSON.collectAsList()

回答1:


Convert large RDD to LIST

It is not a good idea. collect will move data from all executors to driver memory. If memory is not enough then it will throw Out Of Memory (OOM) Exception. If your data is fits in memory of single machine then probably you don't need spark.

Spark natively supports accumulators of numeric types, and programmers can add support for new types. They can be used to implement counters (as in MapReduce) or sums. OUT parameter of accumulator should be a type that can be read atomically (e.g., Int, Long), or thread-safely (e.g., synchronized collections) because it will be read from other threads.

CollectionAccumulator .value returns List (ArrayList implementation) and it will throw OOM if size is greater than driver memory.



来源:https://stackoverflow.com/questions/57799402/what-is-the-performance-difference-between-accumulator-and-collect-in-spark

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!