Spark Python - how to use reduce by key to get minmum/maximum values

问题

I have a sample data of maximum and minimum temperatures of some cities in csv format.

Mumbai,19,30
Delhi,5,41
Kolkata,20,40
Mumbai,18,35
Delhi,4,42
Delhi,10,44
Kolkata,19,39

I want to find out all time lowest temperature recorded for each city using a spark script in Python.

Here is my script

cityTemp = sc.textFile("weather.txt").map(lambda x: x.split(','))

# convert it to pair RDD for performing reduce by Key

cityTemp = cityTemp.map(lambda x: (x[0], tuple(x[1:])))

cityTempMin = cityTemp.reduceByKey(lambda x, y: min(x[0],y[0]))

cityTempMin.collect()

My expected output is as follows

Delhi, 4
Mumbai, 18
Kolkata, 19

However the script is producing the following output.

[(u'Kolkata', u'19'), (u'Mumbai', u'18'), (u'Delhi', u'1')]

How do I get the desired output?

回答1:

Try the below solution, if you have to use reduceByKey function :

  val df = sc.parallelize(Seq(("Mumbai", 19, 30),
    ("Delhi", 5, 41),
    ("Kolkata", 20, 40),
    ("Mumbai", 18, 35),
    ("Delhi", 4, 42),
    ("Delhi", 10, 44),
    ("Kolkata", 19, 39))).map(x => (x._1,x._2)).keyBy(_._1)


    df.reduceByKey((accum, n) => if (accum._2 > n._2) n else  accum).map(_._2).collect().foreach(println)

Output:

(Kolkata,19)
(Delhi,4)
(Mumbai,18)

If you don't want to do a reduceByKey. Just a group by followed by min function would give you desired result.

val df = sc.parallelize(Seq(("Mumbai", 19, 30),
        ("Delhi", 5, 41),
        ("Kolkata", 20, 40),
        ("Mumbai", 18, 35),
        ("Delhi", 4, 42),
        ("Delhi", 10, 44),
        ("Kolkata", 19, 39))).toDF("city", "minTemp", "maxTemp")

        df.groupBy("city").agg(min("minTemp")).show

Output :

+-------+------------+
|   city|min(minTemp)|
+-------+------------+
| Mumbai|          18|
|Kolkata|          19|
|  Delhi|           4|
+-------+------------+

来源：https://stackoverflow.com/questions/44176782/spark-python-how-to-use-reduce-by-key-to-get-minmum-maximum-values

标签

python

apache-spark

pyspark

reduce