I have to add column to a PySpark dataframe based on a list of values.
a= spark.createDataFrame([(\"Dog\", \"Cat\"), (\"Cat\", \"Dog\"), (\"Mouse\", \"Cat\"
I might be wrong, but I believe the accepted answer will not work. monotonically_increasing_id
only guarantees that the ids will be unique and increasing, not that they will be consecutive. Hence using it on two different dataframes will likely create two very different columns, and the join will mostly return empty.
taking inspiration from this answer https://stackoverflow.com/a/48211877/7225303 to a similar question, we could change the incorrect answer to:
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
a= sqlContext.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],
["Animal", "Enemy"])
a.show()
+------+-----+
|Animal|Enemy|
+------+-----+
| Dog| Cat|
| Cat| Dog|
| Mouse| Cat|
+------+-----+
#convert list to a dataframe
rating = [5,4,1]
b = sqlContext.createDataFrame([(l,) for l in rating], ['Rating'])
b.show()
+------+
|Rating|
+------+
| 5|
| 4|
| 1|
+------+
a = a.withColumn("idx", F.monotonically_increasing_id())
b = b.withColumn("idx", F.monotonically_increasing_id())
windowSpec = W.orderBy("idx")
a = a.withColumn("idx", F.row_number().over(windowSpec))
b = b.withColumn("idx", F.row_number().over(windowSpec))
a.show()
+------+-----+---+
|Animal|Enemy|idx|
+------+-----+---+
| Dog| Cat| 1|
| Cat| Dog| 2|
| Mouse| Cat| 3|
+------+-----+---+
b.show()
+------+---+
|Rating|idx|
+------+---+
| 5| 1|
| 4| 2|
| 1| 3|
+------+---+
final_df = a.join(b, a.idx == b.idx).drop("idx")
+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
| Dog| Cat| 5|
| Cat| Dog| 4|
| Mouse| Cat| 1|
+------+-----+------+