Comparing columns in Pyspark

前端 未结 5 854
Happy的楠姐
Happy的楠姐 2020-12-01 18:17

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.

For example:

5条回答
  •  春和景丽
    2020-12-01 18:46

    We can use greatest

    Creating DataFrame

    df = spark.createDataFrame(
        [[1,2,3], [2,1,2], [3,4,5]], 
        ['col_1','col_2','col_3']
    )
    df.show()
    +-----+-----+-----+
    |col_1|col_2|col_3|
    +-----+-----+-----+
    |    1|    2|    3|
    |    2|    1|    2|
    |    3|    4|    5|
    +-----+-----+-----+
    

    Solution

    from pyspark.sql.functions import greatest
    df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
    
    #Only if you need col
    #from pyspark.sql.functions import col
    #df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
    df2.show()
    
    +-----+-----+-----+-----------+
    |col_1|col_2|col_3|max_by_rows|
    +-----+-----+-----+-----------+
    |    1|    2|    3|          3|
    |    2|    1|    2|          2|
    |    3|    4|    5|          5|
    +-----+-----+-----+-----------+
    

提交回复
热议问题