问题
I am trying with the "sample" method of RDD on Spark 1.6.1
scala>val nu = sc.parallelize(1 to 10)
scala>val sp = nu.sample(true,0.2)
scala>sp.collect.foreach(println(_))
3 8
scala>val sp2 = nu.sample(true, 0.2)
scala>sp2.collect.foreach(println(_))
2 4 7 8 10
I cannot understand why sp2 contains 2,4,7,8,10. I think there should be only two numbers printed. Is there anything wrong?
回答1:
Elaborating on the previous answer: in the documentation (scroll down to sample
) it is mentioned (emphasis mine):
fraction: expected size of the sample as a fraction of this RDD's size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0
'Expected' can have several meanings depending on the context, but one meaning it certainly does not have is 'exact', hence the varying exact number of the sample size.
If you want absolutely fixed sample sizes, you may use the takeSample
method, the downside being that it returns an array (i.e. not an RDD), which must fit in your main memory:
val nu = sc.parallelize(1 to 10)
/** set seed for reproducibility */
val sp1 = nu.takeSample(true, 2, 182453)
sp1: Array[Int] = Array(7, 2)
val sp2 = nu.takeSample(true, 2)
sp2: Array[Int] = Array(2, 10)
val sp3 = nu.takeSample(true, 2)
sp2: Array[Int] = Array(4, 6)
回答2:
The fraction does not mean give me this number of element exactly. It says give me this number of elements on average so you will have different numbers of elements if you run several time.
回答3:
sample method on RDD,
Return a sampled subset of this RDD.
The return type is undocumented, so it could be anything from your master RDD.
来源:https://stackoverflow.com/questions/38198231/the-sample-method-of-spark-rdd-does-not-work-as-expected