When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write mul
val df = spark.range(10000000)
df.cache
val catalyst_plan = df.queryExecution.logical
val df_size_in_bytes = spark.sessionState.executePlan(catalyst_plan).optimizedPlan.stats.sizeInBytes
df_size_in_bytes: BigInt = 80000000
The best solution would be take 100 records and estimate the size and apply for all the rows as the above example