Filtering a Pyspark DataFrame with SQL-like IN clause

前端 未结 5 1637
清酒与你
清酒与你 2020-11-27 02:54

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql(\'SELECT * from my         


        
5条回答
  •  情深已故
    2020-11-27 03:48

    A slightly different approach that worked for me is to filter with a custom filter function.

    def filter_func(a):
    """wrapper function to pass a in udf"""
        def filter_func_(col):
        """filtering function"""
            if col in a.value:
                return True
    
        return False
    
    return udf(filter_func_, BooleanType())
    
    # Broadcasting allows to pass large variables efficiently
    a = sc.broadcast((1, 2, 3))
    df = my_df.filter(filter_func(a)(col('field1'))) \
    

提交回复
热议问题