Pyspark dataframe operator “IS NOT IN”

空扰寡人 提交于 2019-12-30 01:40:07

问题


I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))

回答1:


In pyspark you can do it like this:

array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(*array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(*array))



回答2:


Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))



回答3:


df_result = df[df.column_name.isin([1, 2, 3]) == False]



回答4:


slightly different syntax and a "date" data set:

toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)



回答5:


You can also loop the array and filter:

array = [1, 2, 3]
for i in array:
    df = df.filter(df["column"] != i)


来源:https://stackoverflow.com/questions/40287237/pyspark-dataframe-operator-is-not-in

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!