I need to join
many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key a
It is possible using the DataFrame/DataSet API using the repartition
method. Using this method you can specify one or multiple columns to use for data partitioning, e.g.
val df2 = df.repartition($"colA", $"colB")
It is also possible to at the same time specify the number of wanted partitions in the same command,
val df2 = df.repartition(10, $"colA", $"colB")
Note: this does not guarantee that the partitions for the dataframes will be located on the same node, only that the partitioning is done in the same way.