Is there any better method than collect to read an RDD in spark?

最后都变了- 提交于 2019-12-11 09:54:19

问题


So, I want to read and RDD into an array. For that purpose, I could use the collect method. But that method is really annoying as in my case it keeps on giving kyro buffer overflow errors. If I set the kyro buffer size too much, it starts to have its own problems. On the other hand, I have noticed that if I just save the RDD into a file using the saveAsTextFile method, I get no errors. So, I was thinking, there must be some better method of reading an RDD into an array which isn't as problematic as the collect method.


回答1:


No. collect is the only method for reading an RDD into an array.

saveAsTextFile never has to collect all the data to one machine, so it is not limited by the available memory on a single machine in the same way that collect is.




回答2:


toLocalIterator()

This method returns an iterator that contains all of the elements in this RDD.The iterator will consume as much memory as the largest partition in this RDD. Processes as RunJob to evaluate one single partition on each step.

>>> x = rdd.toLocalIterator()
>>> x
<generator object toLocalIterator at 0x283cf00>

then you can access the elements in rdd by

empty_array = []    
for each_element in x:
    empty_array.append(each_element)

https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#toLocalIterator()



来源:https://stackoverflow.com/questions/30333780/is-there-any-better-method-than-collect-to-read-an-rdd-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!