Can Dataframe joins in Spark preserve order?

问题

I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes.

From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions.

How could one perform a join of two DataFrames while preserving the order of one table?

E.g.,

+------------+---------+ | col1 | col2 | +------------+---------+ | 0 | a | | 1 | b | +------------+---------+

joined with

+------------+---------+ | col2 | col3 | +------------+---------+ | b | x | | a | y | +------------+---------+

on col2 should give

+------------+--------------------+ | col1 | col2 | col 3 | +------------+---------+----------+ | 0 | a | y | | 1 | b | x | +------------+---------+----------+

I've heard some things about using coalesce or repartition, but I'm not sure. Any suggestions/methods/insights are appreciated.

Edit: would this be analogous to having one reducer in MapReduce? If so, how would that look like in Spark?

回答1:

It can't. You can add monotonically_increasing_id and reorder data after join.

来源：https://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order

标签

apache-spark

dataframe

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!