Let\'s say I have a Spark DataFrame with the following schema:
root
| -- prob: Double
| -- word: String
I\'d like to randoml
1) You can use one of this DataFrame methods:
randomSplit(weights: Array[Double], seed: Long)randomSplitAsList(weights: Array[Double], seed: Long) or sample(withReplacement: Boolean, fraction: Double)and then take first two Rows.
2) Shuffle rows and take first two of them.
import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)
3) Or you can use takeSample method of the RDD and then convert it to a DataFrame:
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
For example:
dataframe.rdd.takeSample(true, 1000).toDF()