How can I find the size of a RDD

前端 未结 5 1152
失恋的感觉
失恋的感觉 2020-12-04 19:45

I have RDD[Row], which needs to be persisted to a third party repository. But this third party repository accepts of maximum of 5 MB in a single call.

S

5条回答
  •  长情又很酷
    2020-12-04 20:15

    One straight forward way is to call following, depending on whether you want to store your data in serialized form or not, then go to spark UI "Storage" page, you should be able to figure out the total size of the RDD (memory + disk):

    rdd.persist(StorageLevel.MEMORY_AND_DISK)
    
    or
    
    rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
    

    It is not easy to calculate accurate memory size at runtime. You may try do an estimation at runtime though: based on the size data sampled offline, say, X rows used Y GB offline, Z rows at runtime may take Z*Y/X GB; this is similar to Justin suggested earlier.

    Hope this could help.

提交回复
热议问题