Does caching in spark streaming increase performance

坚强是说给别人听的谎言 提交于 2019-12-10 14:28:32

问题


So i'm preforming multiple operations on the same rdd in a kafka stream. Is caching that RDD going to improve performance?


回答1:


When running multiple operations on the same dstream, cache will substantially improve performance. This can be observed on the Spark UI:

Without the use of cache, each iteration on the dstream will take the same time, so the total time to process the data in each batch interval will be linear to the number of iterations on the data:

When cache is used, the first time the transformation pipeline on the RDD is executed, the RDD will be cached and every subsequent iteration on that RDD will only take a fraction of the time to execute.

(In this screenshot, the execution time of that same job was further reduced from 3s to 0.4s by reducing the number of partitions)

Instead of using dstream.cache I would recommend to use dstream.foreachRDD or dstream.transform to gain direct access to the underlying RDD and apply the persist operation. We use matching persist and unpersist around the iterative code to clean up memory as soon as possible:

dstream.foreachRDD{rdd =>
  rdd.cache()
  col.foreach{id => rdd.filter(elem => elem.id == id).map(...).saveAs...}
  rdd.unpersist(true)
}  

Otherwise, one needs to wait for the time configured on spark.cleaner.ttl to clear up the memory.

Note that the default value for spark.cleaner.ttl is infinite, which is not recommended for a production 24x7 Spark Streaming job.




回答2:


Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank.

https://spark.apache.org/docs/latest/quick-start.html#caching



来源:https://stackoverflow.com/questions/30253897/does-caching-in-spark-streaming-increase-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!