How can I find the size of a RDD

前端未结

关注

 5  1152

失恋的感觉 2020-12-04 19:45

I have RDD[Row], which needs to be persisted to a third party repository. But this third party repository accepts of maximum of 5 MB in a single call.

5条回答

长情又很酷 (楼主)

2020-12-04 20:15
One straight forward way is to call following, depending on whether you want to store your data in serialized form or not, then go to spark UI "Storage" page, you should be able to figure out the total size of the RDD (memory + disk):
```
rdd.persist(StorageLevel.MEMORY_AND_DISK)

or

rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
```
It is not easy to calculate accurate memory size at runtime. You may try do an estimation at runtime though: based on the size data sampled offline, say, X rows used Y GB offline, Z rows at runtime may take Z*Y/X GB; this is similar to Justin suggested earlier.

Hope this could help.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...