Assuming I am having the following RDD:
rdd = sc.parallelize([(\'a\', (5,1)), (\'d\', (8,2)), (\'2\', (6,3)), (\'a\', (8,2)), (\'d\', (9,6)), (\'b\', (3,4)),
It is possible but you'll have to include all required information in the composite key:
from pyspark.rdd import portable_hash
n = 2
def partitioner(n):
"""Partition by the first item in the key tuple"""
def partitioner_(x):
return portable_hash(x[0]) % n
return partitioner_
(rdd
.keyBy(lambda kv: (kv[0], kv[1][0])) # Create temporary composite key
.repartitionAndSortWithinPartitions(
numPartitions=n, partitionFunc=partitioner(n), ascending=False)
.map(lambda x: x[1])) # Drop key (note: there is no partitioner set anymore)
Explained step-by-step:
keyBy(lambda kv: (kv[0], kv[1][0]))
creates a substitute key which consist of original key and the first element of the value. In other words it transforms:
(0, (5,1))
into
((0, 5), (0, (5, 1)))
In practice it can be slightly more efficient to simply reshape data to
((0, 5), 1)
partitioner
defines partitioning function based on a hash of the first element of the key so:
partitioner(7)((0, 5))
## 0
partitioner(7)((0, 6))
## 0
partitioner(7)((0, 99))
## 0
partitioner(7)((3, 99))
## 3
as you can see it is consistent and ignores the second bit.
we use default keyfunc
function which is identity (lambda x: x
) and depend on lexicographic ordering defined on Python tuple
:
(0, 5) < (1, 5)
## True
(0, 5) < (0, 4)
## False
As mentioned before you could reshape data instead:
rdd.map(lambda kv: ((kv[0], kv[1][0]), kv[1][1]))
and drop final map
to improve performance.