问题

I am exploring the spark 2.0 java api and have a doubt regarding collect() & collectAsync() available for javardd.

回答1:

Collect action is basically used to view the content of RDD, basically it is synchronous while collectAsync() is asynchronous meaning it Returns a future for retrieving all elements of this RDD. it allows to run other RDD to run in parallel. for better optimization you can utilize fair scheduler for job scheduling.

回答2:

collect():

It returns an array that contains all of the elements in this RDD.

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = sc.parallelize(data, 1);
List<Integer> result = rdd.collect(); 
//elements in will be copied to driver in above step and control will 
//wait till the action completes

collectAsync():

The asynchronous version of collect, which returns a Future(java.util.concurrent.Future) for retrieving an array containing all of the elements in this RDD.

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = sc.parallelize(data, 1);
JavaFutureAction<List<Integer>> future = rdd.collectAsync(); 
// retuns only future object but not data (no latency here)

List<Integer> result = future.get(); 
//Now elements in will be copied to driver

We see the diff in how we receive data only whether synchronous(thread will wait till action completes in collect()) or asynchronous(thread will get Future object and pass on to next instruction)

来源：https://stackoverflow.com/questions/41333190/what-is-the-difference-between-spark-javardd-methods-collect-collectasync

标签

java

apache-spark

rdd

what is the difference between spark javardd methods collect() & collectAsync()?

问题

回答1:

回答2:

collect():

collectAsync():