how to divide rdd data into two in spark?

。_饼干妹妹 提交于 2019-12-04 07:45:38
zero323

Both RDDs

rdd = sc.parallelize(range(10))
test, train = rdd.randomSplit(weights=[0.3, 0.7], seed=1)

test.collect()
## [4, 7, 8]

train.collect()
## [0, 1, 2, 3, 5, 6, 9]

and DataFrames

df = rdd.map(lambda x: (x, )).toDF(["x"])

test, train = df.randomSplit(weights=[0.3, 0.7])

provide randomSplit method which can be used here.

Notes:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!