Pyspark dataframe operator “IS NOT IN”

后端 未结 7 1484
轮回少年
轮回少年 2020-12-08 14:15

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))
相关标签:
7条回答
  • 2020-12-08 14:47

    You can use the .subtract() buddy.

    Example:

    df1 = df.select(col(1),col(2),col(3)) 
    df2 = df.subtract(df1)
    

    This way, df2 will be defined as everything that is df that is not df1.

    0 讨论(0)
  • 2020-12-08 14:49

    slightly different syntax and a "date" data set:

    toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
    df= df.filter(df['DATE'].isin(toGetDates) == False)
    
    0 讨论(0)
  • 2020-12-08 14:50

    Take the operator ~ which means contrary :

    df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))
    
    0 讨论(0)
  • 2020-12-08 14:50
    df_result = df[df.column_name.isin([1, 2, 3]) == False]
    
    0 讨论(0)
  • 2020-12-08 14:53

    In pyspark you can do it like this:

    array = [1, 2, 3]
    dataframe.filter(dataframe.column.isin(array) == False)
    

    Or using the binary NOT operator:

    dataframe.filter(~dataframe.column.isin(array))
    
    0 讨论(0)
  • 2020-12-08 15:05

    * is not needed. So:

    list = [1, 2, 3]
    dataframe.filter(~dataframe.column.isin(list))
    
    0 讨论(0)
提交回复
热议问题