How could I add a column to a DataFrame in Pyspark with incremental values?

拜拜、爱过 提交于 2019-12-08 16:31:41

问题


I have a DataFrame called 'df' like the following:

+-------+-------+-------+
|  Atr1 |  Atr2 |  Atr3 |
+-------+-------+-------+
|   A   |   A   |   A   |
+-------+-------+-------+
|   B   |   A   |   A   |
+-------+-------+-------+
|   C   |   A   |   A   |
+-------+-------+-------+

I want to add a new column to it with incremental values and get the following updated DataFrame:

+-------+-------+-------+-------+
|  Atr1 |  Atr2 |  Atr3 |  Atr4 |
+-------+-------+-------+-------+
|   A   |   A   |   A   |   1   |
+-------+-------+-------+-------+
|   B   |   A   |   A   |   2   |
+-------+-------+-------+-------+
|   C   |   A   |   A   |   3   |
+-------+-------+-------+-------+

How could I get it?


回答1:


If you only need incremental values (like an ID) and if there is no constraint that the numbers need to be consecutive, you could use monotonically_increasing_id(). The only guarantee when using this function is that the values will be increasing for each row, however, the values themself can differ each execution.

from pyspark.sql.functions import monotonically_increasing_id

df.withColumn("Atr4", monotonically_increasing_id())


来源:https://stackoverflow.com/questions/46213986/how-could-i-add-a-column-to-a-dataframe-in-pyspark-with-incremental-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!