SPARK Is sample method on Dataframes uniform sampling?

我怕爱的太早我们不能终老 提交于 2019-11-29 01:09:31

问题


I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?

Thanks


回答1:


There are a few code paths here:

  • If withReplacement = false && fraction > .4 then it uses a souped up random number generator (rng.nextDouble() <= fraction) and lets that do the work. This seems like it would be pretty uniform.
  • If withReplacement = false && fraction <= .4 then it uses a more complex algorithm (GapSamplingIterator) that also seems pretty uniform. At a glance, it looks like it should be uniform also
  • If withReplacement = true it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two



回答2:


yes it is uniform, for more information you can try below code. I hope this clarifies.

I think this should do the trick, where "data" is your data frame . val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1))



来源:https://stackoverflow.com/questions/31633117/spark-is-sample-method-on-dataframes-uniform-sampling

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!