Pyspark: Using repartitionAndSortWithinPartitions with multiple sort Critiria

前端 未结 1 610
再見小時候
再見小時候 2020-12-17 05:22

Assuming I am having the following RDD:

rdd = sc.parallelize([(\'a\', (5,1)), (\'d\', (8,2)), (\'2\', (6,3)), (\'a\', (8,2)), (\'d\', (9,6)), (\'b\', (3,4)),         


        
相关标签:
1条回答
  • 2020-12-17 05:49

    It is possible but you'll have to include all required information in the composite key:

    from pyspark.rdd import portable_hash
    
    n = 2
    
    def partitioner(n):
        """Partition by the first item in the key tuple"""
        def partitioner_(x):
            return portable_hash(x[0]) % n
        return partitioner_
    
    
    (rdd
      .keyBy(lambda kv: (kv[0], kv[1][0]))  # Create temporary composite key
      .repartitionAndSortWithinPartitions(
          numPartitions=n, partitionFunc=partitioner(n), ascending=False)
      .map(lambda x: x[1]))  # Drop key (note: there is no partitioner set anymore)
    

    Explained step-by-step:

    • keyBy(lambda kv: (kv[0], kv[1][0])) creates a substitute key which consist of original key and the first element of the value. In other words it transforms:

      (0, (5,1))
      

      into

      ((0, 5), (0, (5, 1)))
      

      In practice it can be slightly more efficient to simply reshape data to

      ((0, 5), 1)
      
    • partitioner defines partitioning function based on a hash of the first element of the key so:

      partitioner(7)((0, 5))
      ## 0
      
      partitioner(7)((0, 6))
      ## 0
      
      partitioner(7)((0, 99))
      ## 0
      
      partitioner(7)((3, 99))
      ## 3
      

      as you can see it is consistent and ignores the second bit.

    • we use default keyfunc function which is identity (lambda x: x) and depend on lexicographic ordering defined on Python tuple:

      (0, 5) < (1, 5)
      ## True
      
      (0, 5) < (0, 4)
      ## False
      

    As mentioned before you could reshape data instead:

    rdd.map(lambda kv: ((kv[0], kv[1][0]), kv[1][1]))
    

    and drop final map to improve performance.

    0 讨论(0)
提交回复
热议问题