Apache Spark what am I persisting here?

问题

In this line, which RDD is being persisted? dropResultsN or dataSetN?

dropResultsN = dataSetN.map(s -> standin.call(s)).persist(StorageLevel.MEMORY_ONLY());

Question arises as a side issue from Apache Spark timing forEach operation on JavaRDD, where I am still looking for a good answer to the core question of how best to time RDD creation.

回答1:

dropResultsN is the persisted RDD (which is the RDD produced by mapping dataSetN onto the method standin.call()).

回答2:

I found a good example of this in Learning Spark by O'Reilly:

It's example 3-40. persist() in Scala (assuming Java is the same)

import org.apache.spark.storage.StorageLevel

val result = input.map( x => x*x )
result.persist(StorageLevel.[<your choice>][1])

NOTE in Learning Spark: Notice that we called persist() on the RDD before the first action. The persist() call on its own doesn't force evaluation.

MY NOTE that in this example the persist is on the next line, I think this is much more clear than my code in my question.

来源：https://stackoverflow.com/questions/38317733/apache-spark-what-am-i-persisting-here

标签

java

scala

apache-spark

rdd

timing

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!