E-num / get Dummies in pyspark

后端 未结 4 904
野的像风
野的像风 2020-12-18 08:58

I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additiona

4条回答
  •  [愿得一人]
    2020-12-18 09:19

    First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column. Here is sample code using select statement:-

    import pyspark.sql.functions as F
    df = sqlContext.createDataFrame([
        (1, "A", "X1"),
        (2, "B", "X2"),
        (3, "B", "X3"),
        (1, "B", "X3"),
        (2, "C", "X2"),
        (3, "C", "X2"),
        (1, "C", "X1"),
        (1, "B", "X1"),
    ], ["ID", "TYPE", "CODE"])
    
    types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
    codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
    types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
    codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
    df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
    df.show()
    

    OUTPUT

    +---+----+----+--------+--------+--------+---------+---------+---------+
    | ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
    +---+----+----+--------+--------+--------+---------+---------+---------+
    |  1|   A|  X1|       1|       0|       0|        1|        0|        0|
    |  2|   B|  X2|       0|       1|       0|        0|        1|        0|
    |  3|   B|  X3|       0|       1|       0|        0|        0|        1|
    |  1|   B|  X3|       0|       1|       0|        0|        0|        1|
    |  2|   C|  X2|       0|       0|       1|        0|        1|        0|
    |  3|   C|  X2|       0|       0|       1|        0|        1|        0|
    |  1|   C|  X1|       0|       0|       1|        1|        0|        0|
    |  1|   B|  X1|       0|       1|       0|        1|        0|        0|
    +---+----+----+--------+--------+--------+---------+---------+---------+
    

提交回复
热议问题