Partition data for efficient joining for Spark dataframe/dataset

前端未结

关注

 2  431

梦如初夏 2020-12-28 09:52

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key a

2条回答

甜味超标 (楼主)

2020-12-28 10:15
You can repartition a DataFrame after loading it if you know you'll be joining it multiple times
```
val users = spark.read.load("/path/to/users").repartition('userId)

val joined1 = users.join(addresses, "userId")
joined1.show() // <-- 1st shuffle for repartition

val joined2 = users.join(salary, "userId")
joined2.show() // <-- skips shuffle for users since it's already been repartitioned
```
So it'll shuffle the data once and then reuse the shuffle files when joining subsequent times.

However, if you know you'll be repeatedly shuffling data on certain keys, your best bet would be to save the data as bucketed tables. This will write the data out already pre-hash partitioned, so when you read the tables in and join them you avoid the shuffle. You can do so as follows:
```
// you need to pick a number of buckets that makes sense for your data
users.bucketBy(50, "userId").saveAsTable("users")
addresses.bucketBy(50, "userId").saveAsTable("addresses")

val users = spark.read.table("users")
val addresses = spark.read.table("addresses")

val joined = users.join(addresses, "userId")
joined.show() // <-- no shuffle since tables are co-partitioned
```
In order to avoid a shuffle, the tables have to use the same bucketing (e.g. same number of buckets and joining on the bucket columns).
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...