How to do Stratified sampling with Spark DataFrames? [duplicate]

梦想与她 提交于 2019-11-30 18:52:58

问题


I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK


回答1:


Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.

These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:

val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)

Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.



来源:https://stackoverflow.com/questions/30193800/how-to-do-stratified-sampling-with-spark-dataframes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!