Is there any better method than collect to read an RDD in spark?

问题

So, I want to read and RDD into an array. For that purpose, I could use the collect method. But that method is really annoying as in my case it keeps on giving kyro buffer overflow errors. If I set the kyro buffer size too much, it starts to have its own problems. On the other hand, I have noticed that if I just save the RDD into a file using the saveAsTextFile method, I get no errors. So, I was thinking, there must be some better method of reading an RDD into an array which isn't as problematic as the collect method.

回答1:

No. collect is the only method for reading an RDD into an array.

saveAsTextFile never has to collect all the data to one machine, so it is not limited by the available memory on a single machine in the same way that collect is.

回答2:

toLocalIterator()

This method returns an iterator that contains all of the elements in this RDD.The iterator will consume as much memory as the largest partition in this RDD. Processes as RunJob to evaluate one single partition on each step.

>>> x = rdd.toLocalIterator()
>>> x
<generator object toLocalIterator at 0x283cf00>

then you can access the elements in rdd by

empty_array = []    
for each_element in x:
    empty_array.append(each_element)

https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#toLocalIterator()

来源：https://stackoverflow.com/questions/30333780/is-there-any-better-method-than-collect-to-read-an-rdd-in-spark

标签

java

serialization

apache-spark

bigdata