Is there an API function to display “Fraction Cached” for an RDD?

落爺英雄遲暮 提交于 2021-02-07 10:59:26


On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached.

How can I retrieve this percentage programatically?

I can use getStorageLevel() to get some information about RDD caching but not Fraction Cached.

Do I have to calculate it myself?


SparkContext.getRDDStorageInfo is probably the thing you're looking for. It returns an Array of RDDInfo which provides information about:

  • Memory size.
  • Total number of partitions.
  • Number of cached partitions.

It is not directly exposed in PySpark so you'll have to be a bit creative:

from operator import truediv

storage_info =

    "memSize": s.memSize(), 
    "numPartitions": s.numPartitions(), 
    "numCachedPartitions": s.numCachedPartitions(),
    "fractionCached": truediv(s.numCachedPartitions(), s.numPartitions())
} for s in storage_info]

If you have access to the REST API you can of course use it directly:

import requests

url = "http://{0}:{1}/api/v1/applications/{2}/storage/rdd/".format(
    host, port, sc.applicationId

[r.json() for r  in [
   requests.get("{0}{1}".format(url, rdd.get("id"))) for
   rdd  in requests.get(url).json()
] if r.status_code == 200]

