What happens if I cache the same RDD twice in Spark

问题

I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

My question is, since r is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1 and t2 are calculated I will have two instances of r in the cache? or will spark is aware of the fact that r is cached and will ignore it?

回答1:

Nothing. If you call cache on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:

When you call cache, the RDD's storageLevel is set to MEMORY_ONLY
When you call cache again, it's set to the same value (no change)
Upon evaluation, when underlying RDD is materialized, Spark will check the RDD's storageLevel and if it requires caching, it will cache it.

So you're safe.

回答2:

just test on my cluster, Zohar is right, nothing happens, it will just cache the RDD for once. The reason, I think, is that every RDD has an id internally, spark will use the id to mark whether a RDD have been cached or not. so cache one RDD for multiple times will do nothing.

bellow is my code and screenshot:

updated [ add code as required ]

### cache and count, then will show the storage info on WEB UI

raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\
                 .setName("raw_file")\
                 .cache()
raw_file.count()

### try to cache and count again, then take a look at the WEB UI, nothing changes

raw_file.cache()
raw_file.count()

### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still 
### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on 
### the document even then source code

raw_file.setName("raw_file_2")
raw_file.cache().count()

来源：https://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark

标签

java

caching

apache-spark

rdd