Spark list all cached RDD names and unpersist

问题

I am new to Apache Spark, I created several RDD's and DataFrames, cached them, now I want to unpersist some of them by using the command below

rddName.unpersist()

but I can't remember their names. I used sc.getPersistentRDDs but the output does not include the names. I also used the browser to view the cached rdds but again no name information. Am I missing something?

回答1:

@Dikei's answer is actually correct but I believe what you are looking for is sc.getPersistentRDDs :

scala> val rdd1 = sc.makeRDD(1 to 100)
# rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:27

scala> val rdd2 = sc.makeRDD(10 to 1000)
# rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:27

scala> rdd2.cache.setName("rdd_2")
# res0: rdd2.type = rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27

scala> sc.getPersistentRDDs
# res1: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27)

scala> rdd1.cache.setName("foo")
# res2: rdd1.type = foo ParallelCollectionRDD[0] at makeRDD at <console>:27

scala> sc.getPersistentRDDs
# res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)

Now let's add another RDD and name it as well :

scala> rdd3.setName("bar")
# res4: rdd3.type = bar ParallelCollectionRDD[2] at makeRDD at <console>:27

scala> sc.getPersistentRDDs
# res5: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)

We noticed that actually it isn't persisted.

回答2:

PySparkers: getPersistentRDDs isn't yet implemented in Python, so unpersist your RDDs by dipping into Java:

for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
    rdd.unpersist()

回答3:

Scala generic way of doing this ... loop through spark context get all persistent RDDs and unpersist.

I will use this at the end of a driver.

for ( (id,rdd) <- sparkSession.sparkContext.getPersistentRDDs ) {
      log.info("Unexpected cached RDD " + id)
      rdd.unpersist()
    }

Java Generic way of doing this ... where jsc is JavaSparkContext

 if (jsc != null) {
                Map<Integer, JavaRDD<?>> persistentRDDS = jsc.getPersistentRDDs();
                // using for-each loop for iteration over Map.entrySet()
                for (Map.Entry<Integer, JavaRDD<?>> entry : persistentRDDS.entrySet()) {
                    LOG.info("Key = " + entry.getKey() +
                            ", un persisting cached RDD = " + entry.getValue().unpersist());
                }

            }

Another short form of unpersist in java with out knowing rdd names are :

Map<Integer, JavaRDD<?>> persistentRDDS = jsc.getPersistentRDDs();
persistentRDDS.values().forEach(JavaRDD::unpersist);

回答4:

There's no special meaning to the rrdName variable. It is just a reference to an RDD. For example, in the following code

val rrdName: RDD[Something]
val name2 = rrdName

name2 and rrdName are two references that point to the same RDD. Calling name2.unpersist is the same as calling rrdName.unpersist.

If you want to unpersist an RDD, you have to manually keep a reference to it.

来源：https://stackoverflow.com/questions/38508577/spark-list-all-cached-rdd-names-and-unpersist

标签

java

scala

dataframe

apache-spark

rdd