SparkSQL DataFrame order by across partitions

假如想象 提交于 2019-12-10 14:24:51

问题


I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned.

I would like to coalesce the resulting DataFrame and order the rows by a column. I tried

DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")

I also tried

DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")

the output file is ordered in chunks (i.e. the partitions are ordered, but the data frame is not ordered as a whole). For example, instead of

1, value
2, value
4, value
4, value
5, value
5, value
...

I get

2, value
4, value
5, value
-----------> partition boundary
1, value
4, value
5, value
  1. What is the correct way to get an absolute ordering of my query result?
  2. Why isn't the data frame being coalesced into a single partition?

回答1:


I want to mention couple of things here . 1- the source code shows that the orderBy statement internally calls the sorting api with global ordering set to true .So the lack of ordering at the level of the output suggests that the ordering was lost while writing into the target. My point is that a call to orderBy always requires global order.

2- Using a drastic coalesce , as in forcing a single partition in your case , can be really dangerous. I would recommend you do not do that. The source code suggests that calling coalesce(1) can potentially cause upstream transformations to use a single partition . This would be brutal performance wise.

3- You seem to expect the orderBy statement to be executed with a single partition. I do not think that i agree with that statement. That would make Spark a really silly distributed framework.

Community please let me know if you agree or disagree with statements.

how are you collecting data from the output anyway?

maybe the output actually contains sorted data , but the transformations /actions that you performed in order to read from the output is responsible for the order lost.




回答2:


The orderBy will produce new partitions after your coalesce. To have a single output partition, reorder the operations...

DataFrame result = spark.sql("my sql").orderBy("col1").coalesce(1)
result.write.json("results.json")

As @JavaPlanet mentioned, for really big data you don't want to coalesce into a single partition. It will drastically reduce your level of parallelism.



来源:https://stackoverflow.com/questions/31736519/sparksql-dataframe-order-by-across-partitions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!