What is the performance difference between accumulator and collect() in Spark?

问题

Accumulator are basically the shared variable in spark to be updated by executors but read by driver only. Collect() in spark is to get all the data into the driver from executors.

So, in both when I am get the data ultimately in driver only. so, what is the difference in performance when we use accumulator or collect() to convert a large RDD into a LIST?

Code to convert dataframe to List using accumulator

val queryOutput = spark.sql(query)
val acc = spark.sparkContext.collectionAccumulator[Map[String,Any]]("JsonCollector")
val jsonString = queryOutput.foreach(a=>acc.add(convertRowToJSON(a)))
acc.value.asScala.toList


def convertRowToJSON(row: Row): Map[String,Any] = {
    val m = row.getValuesMap(row.schema.fieldNames)
    println(m)
    JSONObject(m).obj
  }

Code to convert dataframe to list using collect()

val queryOutput = spark.sql(query)
queryOutput.toJSON.collectAsList()

回答1:

Convert large RDD to LIST

It is not a good idea. collect will move data from all executors to driver memory. If memory is not enough then it will throw Out Of Memory (OOM) Exception. If your data is fits in memory of single machine then probably you don't need spark.

Spark natively supports accumulators of numeric types, and programmers can add support for new types. They can be used to implement counters (as in MapReduce) or sums. OUT parameter of accumulator should be a type that can be read atomically (e.g., Int, Long), or thread-safely (e.g., synchronized collections) because it will be read from other threads.

CollectionAccumulator .value returns List (ArrayList implementation) and it will throw OOM if size is greater than driver memory.

来源：https://stackoverflow.com/questions/57799402/what-is-the-performance-difference-between-accumulator-and-collect-in-spark

标签

apache-spark