Partition data for efficient joining for Spark dataframe/dataset

前端 未结 2 409
梦如初夏
梦如初夏 2020-12-28 09:52

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key a

2条回答
  •  时光取名叫无心
    2020-12-28 10:05

    It is possible using the DataFrame/DataSet API using the repartition method. Using this method you can specify one or multiple columns to use for data partitioning, e.g.

    val df2 = df.repartition($"colA", $"colB")
    

    It is also possible to at the same time specify the number of wanted partitions in the same command,

    val df2 = df.repartition(10, $"colA", $"colB")
    

    Note: this does not guarantee that the partitions for the dataframes will be located on the same node, only that the partitioning is done in the same way.

提交回复
热议问题