Comparing intersection between two nodes using broadcast variable and using RDD.filter in Spark GraphX

梦想的初衷 提交于 2020-03-04 05:03:11

问题


i work on graphs in GraphX. by using the below code i have made a variable to store neighbors of nodes in RDD:

val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either)

i used broadcast variable to broadcast neighbors to all slaves by using below code:

val broadcastVar = all_neighbors.collect().toMap
val nvalues = sc.broadcast(broadcastVar)

i want to compute intersection between two nodes neighbors. for example intersection between node 1 and node 2 neighbors.

At first i use this code for computing intersection that uses the broadcast variable nvalues:

val common_neighbors=nvalues.value(1).intersect(nvalues.value(2))

and once i used the below code for computing intersection of two nodes:

val common_neighbors2=(all_neighbors.filter(x=>x._1==1)).intersection(all_neighbors.filter(x=>x._1==2))

my question is this: which one of the above methods is efficient and more distributed and parallel? using the broadcast variable nvalue for computing intersection or using filtering RDD method?


回答1:


I think it depends on the situation.

In the case where your nvalues size is less and can fit into each executor and driver node, the approach with broadcasting will be optimal as data is cached in executors and this data is not recomputed over and over again. Also, it will save spark a huge communication and compute burden. In such cases, the other approach is not optimal as it might happen that all_neighbours rdd is calculated every time and this will decrease the performance as there will be a lot of recomputations and will increase computation cost.

In the case where your nvalues cannot fit into each executor and driver node, broadcasting will not work as it will throw an error. Hence, there is no option left but to use the second approach though it might still cause performance issues at least code will work!!

Let me know if it helps!!



来源:https://stackoverflow.com/questions/60493554/comparing-intersection-between-two-nodes-using-broadcast-variable-and-using-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!