Scala: How so i split dataframe to multiple csv files based on number of rows

风格不统一 提交于 2019-12-25 08:40:04

问题


I have a dataframe say df1 with 10M rows. I want to split the same to multiple csv files with 1M rows each. Any suggestions to do the same in scala?


回答1:


You can use the randomSplit method on Dataframes.

import scala.util.Random
val df = List(0,1,2,3,4,5,6,7,8,9).toDF
val splitted = df.randomSplit(Array(1,1,1,1,1)) 
splitted foreach { a => a.write.format("csv").save("path" + Random.nextInt) }

I used the Random.nextInt to have a unique name. You can add some other logic there if necessary.

Source:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

How to save a spark DataFrame as csv on disk?

https://forums.databricks.com/questions/8723/how-can-i-split-a-spark-dataframe-into-n-equal-dat.html

Edit: An alternative approach is to use limit and except:

var input = List(1,2,3,4,5,6,7,8,9).toDF
val limit = 2

var newFrames = List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]()
var size = input.count;

while (size > 0) {
    newFrames = input.limit(limit) :: newFrames
    input = input.except(newFrames.head)
    size = size - limit
}

newFrames.foreach(_.show)

The first element in the resulting list may contain less element than the rest of the list.



来源:https://stackoverflow.com/questions/43567164/scala-how-so-i-split-dataframe-to-multiple-csv-files-based-on-number-of-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!