What does “RDDs can be stored in memory” mean in Spark?

廉价感情. 提交于 2019-12-12 15:06:16

问题


In the introduction of Spark,it says

RDDs can be stored in memory between queries without requiring replication.

As I know,you must cache RDD manually by using .cache() or .persist().If I take neither measure,like below

   val file = sc.textFile("hdfs://data/kv1.txt")
   file.flatMap(line => line.split(" "))
   file.count()

I don't persist the RDD "file" in cache or disk,in this condition, can Spark run faster than MapReduce?


回答1:


What will happen is that Spark will compute, partition by partition, each stage of the computation. It will hold some data temporarily in memory to do its work. It may have to spill data to disk and transfer across the network to execute some stages. But none of this is (necessarily) persistent. If you count() again it would start from scratch.

This is not a case where Spark would run faster than MapReduce; it would probably be slower for a simple operation like this. In fact, there is nothing about this that would benefit from loading into memory.

More complex examples, like with a non-trivial pipeline or repeated access to the RDD, would show a benefit from persisting in memory, or even on disk.




回答2:


Yes tonyking, it will run faster than MapReduce no doubt. Spark processing all RDDs as in memory, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

http://spark.apache.org/docs/latest/programming-guide.html

"This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank"

The answer for your question : "What does “RDDs can be stored in memory” mean in Spark?" is we can STORE one RDD in RAM using .cache() without re computation (while we are applying an action on it).



来源:https://stackoverflow.com/questions/25760206/what-does-rdds-can-be-stored-in-memory-mean-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!