发表新帖

发表新帖

View RDD contents in Python Spark?

后端未结

关注

 6  922

醉酒成梦 2020-11-29 03:40

Running a simple app in pyspark.

f = sc.textFile(\"README.md\")
wc = f.flatMap(lambda x: x.split(\' \')).map(lambda x: (x, 1)).reduceByKey(add)

6条回答

遥遥无期 (楼主)

2020-11-29 04:19
If you want to see the contents of RDD then yes collect is one option, but it fetches all the data to driver so there can be a problem
```
.take()
```
Better if you want to see just a sample

Running foreach and trying to print, I dont recommend this because if you are running this on cluster then the print logs would be local to the executor and it would print for the data accessible to that executor. print statement is not changing the state hence it is not logically wrong. To get all the logs you will have to do something like
```
**Pseudocode**
collect
foreach print
```
But this may result in job failure as collecting all the data on driver may crash it. I would suggest using take command or if u want to analyze it then use sample collect on driver or write to file and then analyze it.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题