发表新帖

发表新帖

Parallelizing independent actions on the same DataFrame in Spark

后端未结

关注

 1  731

伪装坚强ぢ

Let\'s say I have a Spark DataFrame with the following schema:

root
 | -- prob: Double
 | -- word: String

I\'d like to randoml

相关标签:

1条回答

情深已故

2020-12-21 23:55
1) You can use one of this DataFrame methods:
- randomSplit(weights: Array[Double], seed: Long)
- randomSplitAsList(weights: Array[Double], seed: Long) or
- sample(withReplacement: Boolean, fraction: Double)
and then take first two Rows.

2) Shuffle rows and take first two of them.
```
import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)
```
3) Or you can use takeSample method of the RDD and then convert it to a DataFrame:
```
def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]
```
For example:
```
dataframe.rdd.takeSample(true, 1000).toDF()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题