replace column values in spark dataframe based on dictionary similar to np.where

后端 未结 1 698
我寻月下人不归
我寻月下人不归 2020-12-22 04:41

My data frame looks like -

no          city         amount   
1           Kenora        56%
2           Sudbury       23%
3           Kenora        71%
4            


        
相关标签:
1条回答
  • 2020-12-22 05:26

    The problem is that mapping_expr will return null for any city that is not contained in city_dict. A quick fix is to use coalesce to return the city if the mapping_expr returns a null value:

    from pyspark.sql.functions import coalesce
    
    #lookup and replace 
    df1= df.withColumn('new_city', coalesce(mapping_expr[df['city']], df['city']))
    df1.show()
    #+---+--------+------+--------+
    #| no|    city|amount|new_city|
    #+---+--------+------+--------+
    #|  1|  Kenora|   56%|       X|
    #|  2| Sudbury|   23%| Sudbury|
    #|  3|  Kenora|   71%|       X|
    #|  4| Sudbury|   41%| Sudbury|
    #|  5|  Kenora|   33%|       X|
    #|  6| Niagara|   22%|       X|
    #|  7|Hamilton|   88%|Hamilton|
    #+---+--------+------+--------+
    
    df1.groupBy('new_city').count().show()
    #+--------+-----+
    #|new_city|count|
    #+--------+-----+
    #|       X|    4|
    #|Hamilton|    1|
    #| Sudbury|    2|
    #+--------+-----+
    

    The above method will fail, however, if one of the replacement values is null.

    In this case, an easier alternative may be to use pyspark.sql.DataFrame.replace():

    First use withColumn to create new_city as a copy of the values from the city column.

    df.withColumn("new_city", df["city"])\
        .replace(to_replace=city_dict.keys(), value=city_dict.values(), subset="new_city")\
        .groupBy('new_city').count().show()
    #+--------+-----+
    #|new_city|count|
    #+--------+-----+
    #|       X|    4|
    #|Hamilton|    1|
    #| Sudbury|    2|
    #+--------+-----+
    
    0 讨论(0)
提交回复
热议问题