Pre-partition data in spark such that each partition has non-overlapping values in the column we are partitioning on

99封情书 提交于 2021-02-11 15:01:23

问题


I'm trying to pre-partition the data before doing an aggregation operation across a certain column of my data. I have 3 worker nodes and I would llike each partition to have non-overlapping values in the column I am partitioning on. I don't want to have situations where two partitions might have the same values in the column.

e.g. If I have the following data

ss_item_sk | ss_quantity
1          | 10.0
1          |  4.0
2          |  3.0
3          |  5.0
4          |  8.0
5          |  13.0
5          |  10.0

Then the following partitions are satisfactory:

partition 1

ss_item_sk | ss_quantity
1          | 10.0
1          |  4.0

partition 2

ss_item_sk | ss_quantity
2          |  3.0
3          |  5.0

partition 3

ss_item_sk | ss_quantity
4          |  8.0
5          |  13.0
5          |  10.0

Unfortunately, the code I have below does not work.

spark.sqlContext.setConf( "spark.sql.shuffle.partitions", "3")
var json = spark.read.json("hdfs://master:9000/tpcds/store_sales")
var filtered = json.filter(row => row.getAs[Long]("ss_item_sk") < 180)
filtered.repartition($"ss_item_sk").write.json(savepath)

I have already looked at

  • How to define partitioning of DataFrame?
  • Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy?
  • pyspark: Efficiently have partitionBy write to same number of total partitions as original table

and I still can't figure it out.


回答1:


Repartition by key does an distribution of data based on a key in dataframe level. While writing a dataframe on hdfs is a separate thing. you can try

df.coalesce(1).write.partitionBy("ss_item_sk").json(savepath)

In this scenario as well you will see multiple part files in different directory created by partitioned column. The number of writer/reducer that will run can only be controlled based on "partitionBy" method. Its very similar like Map Reduce Partitioner as it controls the number of reducer will run. To get a single file based on the partition column you have to run this command.

df.repartition($"ss_item_sk").write.partitionBy("ss_item_sk").json(savepath)

Now this works as the reducer is getting mapped with the number of executor partition. Hope this helps



来源:https://stackoverflow.com/questions/54060188/pre-partition-data-in-spark-such-that-each-partition-has-non-overlapping-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!