Is there an API function to display “Fraction Cached” for an RDD?

落爺英雄遲暮 提交于 2021-02-07 10:59:26

问题


On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached.

How can I retrieve this percentage programatically?

I can use getStorageLevel() to get some information about RDD caching but not Fraction Cached.

Do I have to calculate it myself?


回答1:


SparkContext.getRDDStorageInfo is probably the thing you're looking for. It returns an Array of RDDInfo which provides information about:

  • Memory size.
  • Total number of partitions.
  • Number of cached partitions.

It is not directly exposed in PySpark so you'll have to be a bit creative:

from operator import truediv

storage_info =  sc._jsc.sc().getRDDStorageInfo()

[{
    "memSize": s.memSize(), 
    "numPartitions": s.numPartitions(), 
    "numCachedPartitions": s.numCachedPartitions(),
    "fractionCached": truediv(s.numCachedPartitions(), s.numPartitions())
} for s in storage_info]

If you have access to the REST API you can of course use it directly:

import requests

url = "http://{0}:{1}/api/v1/applications/{2}/storage/rdd/".format(
    host, port, sc.applicationId
)

[r.json() for r  in [
   requests.get("{0}{1}".format(url, rdd.get("id"))) for
   rdd  in requests.get(url).json()
] if r.status_code == 200]


来源:https://stackoverflow.com/questions/42003533/is-there-an-api-function-to-display-fraction-cached-for-an-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!