问题
I am exploring the spark 2.0 java api and have a doubt regarding collect()
& collectAsync()
available for javardd.
回答1:
Collect action is basically used to view the content of RDD, basically it is synchronous while collectAsync() is asynchronous meaning it Returns a future for retrieving all elements of this RDD. it allows to run other RDD to run in parallel. for better optimization you can utilize fair scheduler for job scheduling.
回答2:
collect():
It returns an array that contains all of the elements in this RDD.
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = sc.parallelize(data, 1);
List<Integer> result = rdd.collect();
//elements in will be copied to driver in above step and control will
//wait till the action completes
collectAsync():
The asynchronous version of collect
, which returns a Future(java.util.concurrent.Future
) for retrieving an array containing all of the elements in this RDD.
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = sc.parallelize(data, 1);
JavaFutureAction<List<Integer>> future = rdd.collectAsync();
// retuns only future object but not data (no latency here)
List<Integer> result = future.get();
//Now elements in will be copied to driver
We see the diff in how we receive data only whether synchronous(thread will wait till action completes in
collect()
) or asynchronous(thread will get Future object and pass on to next instruction)
来源:https://stackoverflow.com/questions/41333190/what-is-the-difference-between-spark-javardd-methods-collect-collectasync