问题
I am new to Apache Spark, I created several RDD's and DataFrames, cached them, now I want to unpersist some of them by using the command below
rddName.unpersist()
but I can't remember their names. I used sc.getPersistentRDDs
but the output does not include the names. I also used the browser to view the cached rdds but again no name information. Am I missing something?
回答1:
@Dikei's answer is actually correct but I believe what you are looking for is sc.getPersistentRDDs
:
scala> val rdd1 = sc.makeRDD(1 to 100)
# rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:27
scala> val rdd2 = sc.makeRDD(10 to 1000)
# rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:27
scala> rdd2.cache.setName("rdd_2")
# res0: rdd2.type = rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res1: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27)
scala> rdd1.cache.setName("foo")
# res2: rdd1.type = foo ParallelCollectionRDD[0] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)
Now let's add another RDD
and name it as well :
scala> rdd3.setName("bar")
# res4: rdd3.type = bar ParallelCollectionRDD[2] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res5: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)
We noticed that actually it isn't persisted.
回答2:
PySparkers: getPersistentRDDs isn't yet implemented in Python, so unpersist your RDDs by dipping into Java:
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
回答3:
Scala generic way of doing this ... loop through spark context get all persistent RDDs and unpersist
.
I will use this at the end of a driver.
for ( (id,rdd) <- sparkSession.sparkContext.getPersistentRDDs ) {
log.info("Unexpected cached RDD " + id)
rdd.unpersist()
}
Java Generic way of doing this ... where jsc is JavaSparkContext
if (jsc != null) {
Map<Integer, JavaRDD<?>> persistentRDDS = jsc.getPersistentRDDs();
// using for-each loop for iteration over Map.entrySet()
for (Map.Entry<Integer, JavaRDD<?>> entry : persistentRDDS.entrySet()) {
LOG.info("Key = " + entry.getKey() +
", un persisting cached RDD = " + entry.getValue().unpersist());
}
}
Another short form of unpersist
in java with out knowing rdd names are :
Map<Integer, JavaRDD<?>> persistentRDDS = jsc.getPersistentRDDs();
persistentRDDS.values().forEach(JavaRDD::unpersist);
回答4:
There's no special meaning to the rrdName
variable. It is just a reference to an RDD. For example, in the following code
val rrdName: RDD[Something]
val name2 = rrdName
name2
and rrdName
are two references that point to the same RDD. Calling name2.unpersist
is the same as calling rrdName.unpersist
.
If you want to unpersist
an RDD, you have to manually keep a reference to it.
来源:https://stackoverflow.com/questions/38508577/spark-list-all-cached-rdd-names-and-unpersist