Partition data for efficient joining for Spark dataframe/dataset

前端未结

关注

 2  409

梦如初夏 2020-12-28 09:52

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key a

2条回答

时光取名叫无心 (楼主)

2020-12-28 10:05
It is possible using the DataFrame/DataSet API using the repartition method. Using this method you can specify one or multiple columns to use for data partitioning, e.g.
```
val df2 = df.repartition($"colA", $"colB")
```
It is also possible to at the same time specify the number of wanted partitions in the same command,
```
val df2 = df.repartition(10, $"colA", $"colB")
```
Note: this does not guarantee that the partitions for the dataframes will be located on the same node, only that the partitioning is done in the same way.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...