发表新帖

发表新帖

View RDD contents in Python Spark?

后端未结

关注

 6  932

醉酒成梦 2020-11-29 03:40

Running a simple app in pyspark.

f = sc.textFile(\"README.md\")
wc = f.flatMap(lambda x: x.split(\' \')).map(lambda x: (x, 1)).reduceByKey(add)

6条回答

无人及你 (楼主)

2020-11-29 04:22
Try this:
```
data = f.flatMap(lambda x: x.split(' '))
map = data.map(lambda x: (x, 1))
mapreduce = map.reduceByKey(lambda x,y: x+y)
result = mapreduce.collect()
```
Please note that when you run collect(), the RDD - which is a distributed data set is aggregated at the driver node and is essentially converted to a list. So obviously, it won't be a good idea to collect() a 2T data set. If all you need is a couple of samples from your RDD, use take(10).
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题