Spark dataframe add new column with random data

被刻印的时光 ゝ 提交于 2021-02-06 16:01:54

问题


I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from,

from random import randint

df1 = df.withColumn('isVal',randint(0,1))

But I get the following error,

/spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column

how to use a custom function or randint function for generate random value for the column?


回答1:


You are using python builtin random. This returns a specific value which is constant (the returned value).

As the error message shows, we expect a column which represents the expression.

To do this do:

from pyspark.sql.functions import rand,when
df1 = df.withColumn('isVal', when(rand() > 0.5, 1).otherwise(0))

This would give a uniform distribution between 0 and 1. See the functions documentation for more options (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)




回答2:


Had a similar problem with integer values from 5 to 10. I've used the rand() function from pyspark.sql.functions

from pyspark.sql.functions import *
df1 = df.withColumn("random", round(rand()*(10-5)+5,0))


来源:https://stackoverflow.com/questions/41459138/spark-dataframe-add-new-column-with-random-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!