How to replace all Null values of a dataframe in Pyspark

不羁的心 提交于 2019-12-18 04:28:11

问题


I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values null.

For example:

Column_1 column_2
null     null
null     null
234      null
125      124
365      187
and so on

When I want to do a sum of column_1 I am getting a Null as a result, instead of 724.

Now I want to replace the null in all columns of the data frame with empty space. So when I try to do a sum of these columns I don't get a null value but I will get a numerical value.

How can we achieve that in pyspark


回答1:


You can use df.na.fill to replace nulls with zeros, for example:

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df.na.fill(0).show()
+---+
|col|
+---+
|  1|
|  2|
|  3|
|  0|
+---+



回答2:


You can use fillna() func.

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df = df.fillna({'col':'4'})
>>> df.show()

or df.fillna({'col':'4'}).show()

+---+
|col|
+---+
|  1|
|  2|
|  3|
|  4|
+---+


来源:https://stackoverflow.com/questions/42312042/how-to-replace-all-null-values-of-a-dataframe-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!