Comparing columns in Pyspark

前端 未结 5 860
Happy的楠姐
Happy的楠姐 2020-12-01 18:17

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.

For example:

5条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-01 18:46

    Another simple way of doing it. Let us say that the below df is your dataframe

    df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
    df.show()
    
    +---+---+---+
    | c1| c2| c3|
    +---+---+---+
    | 10| 10|  1|
    |200|  2| 20|
    |  3| 30|300|
    |400| 40|  4|
    +---+---+---+
    

    You can process the above df as below to get the desited results

    from pyspark.sql.functions import lit, min
    
    df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
               lit('c2').alias('cn2'), min(df.c2).alias('c2'),
               lit('c3').alias('cn3'), min(df.c3).alias('c3')
              )\
             .rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
             .toDF(['Columnn', 'Min']).show()
    
    +-------+---+
    |Columnn|Min|
    +-------+---+
    |     c1|  3|
    |     c2|  2|
    |     c3|  1|
    +-------+---+
    

提交回复
热议问题