问题
The following works fine with pyspark using python2:
data = [
('A', 2.), ('A', 4.), ('A', 9.),
('B', 10.), ('B', 20.),
('Z', 3.), ('Z', 5.), ('Z', 8.), ('Z', 12.)
]
rdd = sc.parallelize( data )
sumCount = rdd.combineByKey(lambda value: (value, 1),
lambda x, value: (x[0] + value, x[1] + 1),
lambda x, y: (x[0] + y[0], x[1] + y[1])
)
averageByKey = sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count))
averageByKey.collectAsMap()
The line:
averageByKey = sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count))
returns under python3:
SyntaxError: invalid syntax
File "<command-2372155099811162>", line 14
averageByKey = sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count))
Cannot find what python3 change causes this and alternative.
回答1:
The following code in pyspark using python3 works:
data = sc.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] )
sumCount = data.combineByKey(lambda value: (value, 1),
lambda x, value: (x[0] + value, x[1] + 1),
lambda x, y: (x[0] + y[0], x[1] + y[1]))
averageByKey = sumCount.map(lambda label_value_sum_count: (label_value_sum_count[0], label_value_sum_count[1][0] / label_value_sum_count[1][1]))
print(averageByKey.collectAsMap())
returns correctly:
{0: 3.0, 1: 10.0}
python2 & python3 have some differences and a lot of the stuff on SO is python2.
来源:https://stackoverflow.com/questions/58276755/combinebykey-works-fine-with-pyspark-python-2-but-not-python-3