Prepare my bigdata with Spark via Python

折月煮酒 提交于 2019-11-29 17:01:39
malisit

You can use a bunch of basic pyspark transformations to achieve this.

>>> rdd = sc.parallelize([(1424411938, [3885, 7898]),(3333333333, [3885, 7898])])
>>> r = rdd.flatMap(lambda x: ((a,x[0]) for a in x[1]))

We used flatMap to have a key, value pair for every item in x[1] and we changed the data line format to (a, x[0]), the a here is every item in x[1]. To understand flatMap better you can look to the documentation.

>>> r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])))

We just grouped all key, value pairs by their keys and used tuple function to convert iterable to tuple.

>>> r2.collect()
[(3885, (1424411938, 3333333333)), (7898, (1424411938, 3333333333))]

As you said you can use [:150] to have first 150 elements, I guess this would be proper usage:

r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))

I tried to be as explanatory as possible. I hope this helps.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!