PySpark - Adding a Column from a list of values using a UDF

前端 未结 5 1500
臣服心动
臣服心动 2021-01-05 00:32

I have to add column to a PySpark dataframe based on a list of values.

a= spark.createDataFrame([(\"Dog\", \"Cat\"), (\"Cat\", \"Dog\"), (\"Mouse\", \"Cat\"         


        
5条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-05 01:30

    I might be wrong, but I believe the accepted answer will not work. monotonically_increasing_id only guarantees that the ids will be unique and increasing, not that they will be consecutive. Hence using it on two different dataframes will likely create two very different columns, and the join will mostly return empty.

    taking inspiration from this answer https://stackoverflow.com/a/48211877/7225303 to a similar question, we could change the incorrect answer to:

    from pyspark.sql.window import Window as W
    from pyspark.sql import functions as F
    
    a= sqlContext.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],
                                   ["Animal", "Enemy"])
    
    a.show()
    
    +------+-----+
    |Animal|Enemy|
    +------+-----+
    |   Dog|  Cat|
    |   Cat|  Dog|
    | Mouse|  Cat|
    +------+-----+
    
    
    
    #convert list to a dataframe
    rating = [5,4,1]
    b = sqlContext.createDataFrame([(l,) for l in rating], ['Rating'])
    b.show()
    
    +------+
    |Rating|
    +------+
    |     5|
    |     4|
    |     1|
    +------+
    
    
    a = a.withColumn("idx", F.monotonically_increasing_id())
    b = b.withColumn("idx", F.monotonically_increasing_id())
    
    windowSpec = W.orderBy("idx")
    a = a.withColumn("idx", F.row_number().over(windowSpec))
    b = b.withColumn("idx", F.row_number().over(windowSpec))
    
    a.show()
    +------+-----+---+
    |Animal|Enemy|idx|
    +------+-----+---+
    |   Dog|  Cat|  1|
    |   Cat|  Dog|  2|
    | Mouse|  Cat|  3|
    +------+-----+---+
    
    b.show()
    +------+---+
    |Rating|idx|
    +------+---+
    |     5|  1|
    |     4|  2|
    |     1|  3|
    +------+---+
    
    final_df = a.join(b, a.idx == b.idx).drop("idx")
    
    +------+-----+------+
    |Animal|Enemy|Rating|
    +------+-----+------+
    |   Dog|  Cat|     5|
    |   Cat|  Dog|     4|
    | Mouse|  Cat|     1|
    +------+-----+------+
    

提交回复
热议问题