what is the difference between spark javardd methods collect() & collectAsync()?

不问归期 提交于 2019-12-22 12:21:34

问题


I am exploring the spark 2.0 java api and have a doubt regarding collect() & collectAsync() available for javardd.


回答1:


Collect action is basically used to view the content of RDD, basically it is synchronous while collectAsync() is asynchronous meaning it Returns a future for retrieving all elements of this RDD. it allows to run other RDD to run in parallel. for better optimization you can utilize fair scheduler for job scheduling.




回答2:


collect():

It returns an array that contains all of the elements in this RDD.

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = sc.parallelize(data, 1);
List<Integer> result = rdd.collect(); 
//elements in will be copied to driver in above step and control will 
//wait till the action completes

collectAsync():

The asynchronous version of collect, which returns a Future(java.util.concurrent.Future) for retrieving an array containing all of the elements in this RDD.

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = sc.parallelize(data, 1);
JavaFutureAction<List<Integer>> future = rdd.collectAsync(); 
// retuns only future object but not data (no latency here)

List<Integer> result = future.get(); 
//Now elements in will be copied to driver

We see the diff in how we receive data only whether synchronous(thread will wait till action completes in collect()) or asynchronous(thread will get Future object and pass on to next instruction)



来源:https://stackoverflow.com/questions/41333190/what-is-the-difference-between-spark-javardd-methods-collect-collectasync

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!