Limiting maximum size of dataframe partition

前端 未结 2 2046
日久生厌
日久生厌 2021-02-13 22:54

When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write mul

2条回答
  •  刺人心
    刺人心 (楼主)
    2021-02-13 23:42

        val df = spark.range(10000000)
        df.cache     
        val catalyst_plan = df.queryExecution.logical
        val df_size_in_bytes = spark.sessionState.executePlan(catalyst_plan).optimizedPlan.stats.sizeInBytes
    

    df_size_in_bytes: BigInt = 80000000

    The best solution would be take 100 records and estimate the size and apply for all the rows as the above example

提交回复
热议问题