Limiting maximum size of dataframe partition

前端未结

关注

 2  2046

日久生厌 2021-02-13 22:54

When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write mul

2条回答

刺人心 (楼主)

2021-02-13 23:42
```
    val df = spark.range(10000000)
    df.cache     
    val catalyst_plan = df.queryExecution.logical
    val df_size_in_bytes = spark.sessionState.executePlan(catalyst_plan).optimizedPlan.stats.sizeInBytes
```
df_size_in_bytes: BigInt = 80000000

The best solution would be take 100 records and estimate the size and apply for all the rows as the above example
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...