PySpark - Adding a Column from a list of values using a UDF

前端 未结 5 1546
臣服心动
臣服心动 2021-01-05 00:32

I have to add column to a PySpark dataframe based on a list of values.

a= spark.createDataFrame([(\"Dog\", \"Cat\"), (\"Cat\", \"Dog\"), (\"Mouse\", \"Cat\"         


        
5条回答
  •  南笙
    南笙 (楼主)
    2021-01-05 01:17

    Hope this helps!

    from pyspark.sql.functions import monotonically_increasing_id, row_number
    from pyspark.sql import Window
    
    #sample data
    a= sqlContext.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],
                                   ["Animal", "Enemy"])
    a.show()
    
    #convert list to a dataframe
    rating = [5,4,1]
    b = sqlContext.createDataFrame([(l,) for l in rating], ['Rating'])
    
    #add 'sequential' index and join both dataframe to get the final result
    a = a.withColumn("row_idx", row_number().over(Window.orderBy(monotonically_increasing_id())))
    b = b.withColumn("row_idx", row_number().over(Window.orderBy(monotonically_increasing_id())))
    
    final_df = a.join(b, a.row_idx == b.row_idx).\
                 drop("row_idx")
    final_df.show()
    

    Input:

    +------+-----+
    |Animal|Enemy|
    +------+-----+
    |   Dog|  Cat|
    |   Cat|  Dog|
    | Mouse|  Cat|
    +------+-----+
    

    Output is:

    +------+-----+------+
    |Animal|Enemy|Rating|
    +------+-----+------+
    |   Cat|  Dog|     4|
    |   Dog|  Cat|     5|
    | Mouse|  Cat|     1|
    +------+-----+------+
    

提交回复
热议问题