How do simple random sampling and dataframe SAMPLE function work in Apache Spark (Scala)?

前端 未结 2 1076
说谎
说谎 2020-12-06 21:28

Q1. I am trying to get a simple random sample out of a Spark dataframe (13 rows) using the sample function with parameters withReplacement: false, fraction: 0.6 but it gives

相关标签:
2条回答
  • 2020-12-06 22:03

    The RDD API includes takeSample, which will return a "sample of specified size in an array". It works by calling sample until it gets a sample size greater than the requested one, then randomly taking the specified number from that. The code comments that it shouldn't have to iterate often due to a bias toward large sample sizes.

    0 讨论(0)
  • 2020-12-06 22:09

    How is the sample obtained after random number generation?

    Depending on a fraction you want to sample there are two different algorithms. You can check Justin's Pihony answer to SPARK Is sample method on Dataframes uniform sampling?

    it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

    If fraction is above RandomSampler.defaultMaxGapSamplingFraction sampling is done by a simple filter:

    items.filter { _ => rng.nextDouble() <= fraction }
    

    otherwise, simplifying things a little bit, it is repeatedly calling drop method using random integers and takes next item.

    Keeping that in mind it should be obvious that a number of returned elements will be random with mean, assuming there is nothing wrong with GapSamplingIterator, equal to fraction * rdd.count. If you set seed you get the same sequence of random numbers and as a consequence the same elements are included in the sample.

    0 讨论(0)
提交回复
热议问题