发表新帖

发表新帖

Pyspark dataframe operator “IS NOT IN”

后端未结

关注

 7  1511

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))

相关标签:

7条回答

礼貌的吻别

2020-12-08 14:47
You can use the .subtract() buddy.

Example:
```
df1 = df.select(col(1),col(2),col(3)) 
df2 = df.subtract(df1)
```
This way, df2 will be defined as everything that is df that is not df1.
0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2020-12-08 14:49
slightly different syntax and a "date" data set:
```
toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2020-12-08 14:50
Take the operator ~ which means contrary :
```
df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-12-08 14:50
```
df_result = df[df.column_name.isin([1, 2, 3]) == False]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2020-12-08 14:53
In pyspark you can do it like this:
```
array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)
```
Or using the binary NOT operator:
```
dataframe.filter(~dataframe.column.isin(array))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-12-08 15:05
* is not needed. So:
```
list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

热议问题