Difference of elements in list in PySpark

白昼怎懂夜的黑 提交于 2021-02-07 10:59:28

问题


I have a PySpark dataframe (df) with a column which contains lists with two elements. The two elements in the list are not ordered by ascending or descending orders.

+--------+----------+-------+
| version| timestamp| list  |
+--------+-----+----|-------+
| v1     |2012-01-10| [5,2] |
| v1     |2012-01-11| [2,5] |
| v1     |2012-01-12| [3,2] |
| v2     |2012-01-12| [2,3] |
| v2     |2012-01-11| [1,2] |
| v2     |2012-01-13| [2,1] |
+--------+----------+-------+

I want to take difference betweeen the first and the second elements of the list and have that as another column (diff). Here is an example of the output that I want.

+--------+----------+-------+-------+
| version| timestamp| list  |  diff | 
+--------+-----+----|-------+-------+
| v1     |2012-01-10| [5,2] |   3   |
| v1     |2012-01-11| [2,5] |  -3   |
| v1     |2012-01-12| [3,2] |   1   |
| v2     |2012-01-12| [2,3] |  -1   |
| v2     |2012-01-11| [1,2] |  -1   |
| v2     |2012-01-13| [2,1] |   1   |
+--------+----------+-------+-------+

How can I do this using PySpark?

I tried the following:

transform_expr = (
        "transform(diff, x-y ->"
        + "x as list[0], y as list[1])"
    )

df = df.withColumn("diff", F.expr(transform_expr)) 

But, the above technique did not give me any output.

I am also open to the use of UDFs to get my intended output in case one needs that.

Approaches without UDF and those which are based on UDF are both welcome. Thanks.


回答1:


There are multiple ways to do this, you can use any of element_at (Spark 2.4 or newer), transform, array index[0] or .getItem() to get the difference.

#sample dataframe
df=spark.createDataFrame([([5,2],),([2,5],)],["list"])

#using element_at
df.withColumn("diff",element_at(col("list"),1) - element_at(col("list"),2)).show()

#using transform 
df.withColumn("diff",concat_ws("",expr("""transform(array(list),x -> x[0] - x[1])"""))).show()

#using array index
df.withColumn("diff",col("list")[0]- col("list")[1]).show()

#using .getItem
df.withColumn("diff",col("list").getItem(0)- col("list").getItem(1)).show()

#+------+----+
#|  list|diff|
#+------+----+
#|[5, 2]|   3|
#|[2, 5]|  -3|
#+------+----+


来源:https://stackoverflow.com/questions/61400653/difference-of-elements-in-list-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!