PySpark create new column with mapping from a dict

后端 未结 2 1364
感情败类
感情败类 2020-12-05 02:43

Using Spark 1.6, I have a Spark DataFrame column (named let\'s say col1) with values A, B, C, DS, DNS, E, F, G and H and I want to create a new col

相关标签:
2条回答
  • 2020-12-05 03:29

    Sounds like the simplest solution would be to use the replace function: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace

    mapping= {
            'A': '1',
            'B': '2'
        }
    df2 = df.replace(to_replace=mapping, subset=['yourColName'])
    
    0 讨论(0)
  • 2020-12-05 03:37

    Inefficient solution with UDF (version independent):

    from pyspark.sql.types import StringType
    from pyspark.sql.functions import udf
    
    def translate(mapping):
        def translate_(col):
            return mapping.get(col)
        return udf(translate_, StringType())
    
    df = sc.parallelize([('DS', ), ('G', ), ('INVALID', )]).toDF(['key'])
    mapping = {
        'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 
        'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}
    
    df.withColumn("value", translate(mapping)("key"))
    

    with the result:

    +-------+-----+
    |    key|value|
    +-------+-----+
    |     DS|    S|
    |      G|   NS|
    |INVALID| null|
    +-------+-----+
    

    Much more efficient (Spark >= 2.0, Spark < 3.0) is to create a MapType literal:

    from pyspark.sql.functions import col, create_map, lit
    from itertools import chain
    
    mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])
    
    df.withColumn("value", mapping_expr.getItem(col("key")))
    

    with the same result:

    +-------+-----+
    |    key|value|
    +-------+-----+
    |     DS|    S|
    |      G|   NS|
    |INVALID| null|
    +-------+-----+
    

    but more efficient execution plan:

    == Physical Plan ==
    *Project [key#15, keys: [B,DNS,DS,F,E,H,C,G,A], values: [S,S,S,NS,NS,NS,S,NS,S][key#15] AS value#53]
    +- Scan ExistingRDD[key#15]
    

    compared to UDF version:

    == Physical Plan ==
    *Project [key#15, pythonUDF0#61 AS value#57]
    +- BatchEvalPython [translate_(key#15)], [key#15, pythonUDF0#61]
       +- Scan ExistingRDD[key#15]
    

    In Spark >= 3.0 getItem should be replaced with __getitem__ ([]), i.e:

    df.withColumn("value", mapping_expr[col("key")]).show()
    
    0 讨论(0)
提交回复
热议问题