问题
How can I get a random row from a PySpark DataFrame? I only see the method sample()
which takes a fraction as parameter. Setting this fraction to 1/numberOfRows
leads to random results, where sometimes I won't get any row.
On RRD
there is a method takeSample()
that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?
回答1:
You can simply call takeSample
on a RDD
:
df = sqlContext.createDataFrame(
[(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v"))
df.rdd.takeSample(False, 1, seed=0)
## [Row(k=3, v='c')]
If you don't want to collect you can simply take a higher fraction and limit:
df.sample(False, 0.1, seed=0).limit(1)
来源:https://stackoverflow.com/questions/34003314/how-take-a-random-row-from-a-pyspark-dataframe