How can I estimate the size in bytes of each column in a Spark DataFrame?

僤鯓⒐⒋嵵緔 提交于 2021-02-08 08:16:03

问题


I have a very large Spark DataFrame with a number of columns, and I want to make an informed judgement about whether or not to keep them in my pipeline, in part based on how big they are. By "how big," I mean the size in bytes in RAM when this DataFrame is cached, which I expect to be a decent estimate for the computational cost of processing this data. Some columns are simple types (e.g. doubles, integers) but others are complex types (e.g. arrays and maps of variable length).

An approach I have tried is to cache the DataFrame without and then with the column in question, check out the Storage tab in the Spark UI, and take the difference. But this is an annoying and slow exercise for a DataFrame with a lot of columns.

I typically use PySpark so a PySpark answer would be preferable, but Scala would be fine as well.


回答1:


I found a solution which builds off of this related answer: https://stackoverflow.com/a/49529028.

Assuming I'm working with a dataframe called df and a SparkSession object called spark:

import org.apache.spark.sql.{functions => F}

// force the full dataframe into memory (could specify persistence
// mechanism here to ensure that it's really being cached in RAM)
df.cache()
df.count()

// calculate size of full dataframe
val catalystPlan = df.queryExecution.logical
val dfSizeBytes = spark.sessionState.executePlan(catalystPlan).optimizedPlan.stats.sizeInBytes

for (col <- df.columns) {
    println("Working on " + col)

    // select all columns except this one:
    val subDf = df.select(df.columns.filter(_ != col).map(F.col): _*)

    // force subDf into RAM
    subDf.cache()
    subDf.count()

    // calculate size of subDf
    val catalystPlan = subDf.queryExecution.logical
    val subDfSizeBytes = spark.sessionState.executePlan(catalystPlan).optimizedPlan.stats.sizeInBytes

    // size of this column as a fraction of full dataframe
    val colSizeFrac = (dfSizeBytes - subDfSizeBytes).toDouble / dfSizeBytes.toDouble
    println("Column space fraction is " + colSizeFrac * 100.0 + "%")
    subDf.unpersist()
}

Some confirmations that this approach gives sensible results:

  1. The reported column sizes add up to 100%.
  2. Simple type columns like integers or doubles take up the expected 4 bytes or 8 bytes per row.


来源:https://stackoverflow.com/questions/54870668/how-can-i-estimate-the-size-in-bytes-of-each-column-in-a-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!