Ordering of rows in JavaRdds after union

不羁的心 提交于 2021-01-28 08:08:45

问题


I am trying to find out any information on the ordering of the rows in a RDD. Here is what I am trying to do:

Rdd1, Rdd2 
Rdd3 = Rdd1.union(rdd2); 

in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards? For my tests I saw this behaviorunion happening but wasn't able to find it in any docs.

just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is the requirement).


回答1:


In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered http://spark.apache.org/docs/latest/programming-guide.html#background

If you check your RDD3, you should find that RDD3 is just all the partitions of RDD1 followed by all the partitions of RDD2, so in this case the results happen to be ordered in the way you want. You can read here that simply concatenating the partitions from the 2 RDDs is the standard behaviour of Spark In Apache Spark, why does RDD.union not preserve the partitioner?

So in this case, it appears that Union will give you what you want. However this behaviour is an implementation detail of Union, it is not part of its interface definition, so you cannot rely on the fact that it won't be reimplemented with different behaviour in the future.



来源:https://stackoverflow.com/questions/31820230/ordering-of-rows-in-javardds-after-union

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!