how to divide rdd data into two in spark?

一世执手 提交于 2019-12-06 02:27:49

问题


I have a data in Spark RDD and I want to divide it into two part with a scale such as 0.7. For example if the RDD looks like this:

[1,2,3,4,5,6,7,8,9,10]

I want to divide it into rdd1:

 [1,2,3,4,5,6,7]

and rdd2:

[8,9,10]

with the scale 0.7. The rdd1 and rdd2 should be random every time. I tried this way:

seed = random.randint(0,10000)
rdd1 = data.sample(False,scale,seed)
rdd2 = data.subtract(rdd1)

and it works sometimes but when my data contains dict I experienced some problems. For example with data as follows:

[{1:2},{3:1},{5:4,2;6}]

I get

TypeError: unhashable type: 'dict'


回答1:


Both RDDs

rdd = sc.parallelize(range(10))
test, train = rdd.randomSplit(weights=[0.3, 0.7], seed=1)

test.collect()
## [4, 7, 8]

train.collect()
## [0, 1, 2, 3, 5, 6, 9]

and DataFrames

df = rdd.map(lambda x: (x, )).toDF(["x"])

test, train = df.randomSplit(weights=[0.3, 0.7])

provide randomSplit method which can be used here.

Notes:

  • randomSplit is expressed using a single filter for each output RDD. In general it is not possible to yield multiple RDDs from a single Spark transformation. See https://stackoverflow.com/a/32971246/1560062 for details.

  • You cannot use subtract with dictionaries because internally it is expressed cogorup and because of that requires objects to be hashable. See also A list as a key for PySpark's reduceByKey



来源:https://stackoverflow.com/questions/26943763/how-to-divide-rdd-data-into-two-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!