How to split a spark dataframe with equal records

问题

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?

回答1:

In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.

For that you usually:

Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)

After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.

Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:

val npartitions = 3

val foldedRDD = 
   // Map each instance with random number
   .zipWithIndex
   .map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
   // Random ordering
   .sortBy( t => (t._1(m_classIndex), t._3) )
   // Assign each instance to fold
   .zipWithIndex
   .map( t => (t._1, t._2 % npartitions) )

val balancedRDDList =  
    for (f <- 0 until npartitions) 
    yield foldedRDD.filter( _._2 == f )

来源：https://stackoverflow.com/questions/41223125/how-to-split-a-spark-dataframe-with-equal-records

标签

pyspark

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!