How to get the size of a data frame before doing the broadcast join in pyspark

时光总嘲笑我的痴心妄想 提交于 2021-02-08 09:14:02

问题


I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast..

Is there anyway to find the size of a data frame .

I am using Python as my programming language for spark

Any help much appreciated


回答1:


If you are looking for size in bytes as well as size in row count follow this-

Alternative-1

 // ### Alternative -1
    /**
      * file content
      * spark-test-data.json
      * --------------------
      * {"id":1,"name":"abc1"}
      * {"id":2,"name":"abc2"}
      * {"id":3,"name":"abc3"}
      */
    val fileName = "spark-test-data.json"
    val path = getClass.getResource("/" + fileName).getPath

    spark.catalog.createTable("df", path, "json")
      .show(false)

    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |1  |abc1|
      * |2  |abc2|
      * |3  |abc3|
      * +---+----+
      */
    // Collect only statistics that do not require scanning the whole table (that is, size in bytes).
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+---------+-------+
      * |col_name  |data_type|comment|
      * +----------+---------+-------+
      * |Statistics|68 bytes |       |
      * +----------+---------+-------+
      */
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+----------------+-------+
      * |col_name  |data_type       |comment|
      * +----------+----------------+-------+
      * |Statistics|68 bytes, 3 rows|       |
      * +----------+----------------+-------+
      */

Alternative-2


    // ### Alternative 2

    val df = spark.range(10)
    df.createOrReplaceTempView("myView")
    spark.sql("explain cost select * from myView").show(false)

    /**
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |plan                                                                                                                                                                    |
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |== Optimized Logical Plan ==
      * Range (0, 10, step=1, splits=Some(2)), Statistics(sizeInBytes=80.0 B, hints=none)
      *
      * == Physical Plan ==
      * *(1) Range (0, 10, step=1, splits=2)|
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      */

Alternative-3

    // ### altervative 3
    println(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes) 
// 80


来源:https://stackoverflow.com/questions/62461550/how-to-get-the-size-of-a-data-frame-before-doing-the-broadcast-join-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!