PySpark: TypeError: condition should be string or Column

后端 未结 3 1115
天命终不由人
天命终不由人 2020-12-17 10:58

I am trying to filter an RDD based like below:

spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(         


        
相关标签:
3条回答
  • 2020-12-17 11:28

    convert the dataframe into rdd.

    spark_df = sc.createDataFrame(pandas_df)
    spark_df.rdd.filter(lambda r: str(r['target']).startswith('good'))
    spark_df.take(5)
    

    I think it may work!

    0 讨论(0)
  • 2020-12-17 11:31

    DataFrame.filter, which is an alias for DataFrame.where, expects a SQL expression expressed either as a Column:

    spark_df.filter(col("target").like("good%"))
    

    or equivalent SQL string:

    spark_df.filter("target LIKE 'good%'")
    

    I believe you're trying here to use RDD.filter which is completely different method:

    spark_df.rdd.filter(lambda r: r['target'].startswith('good'))
    

    and does not benefit from SQL optimizations.

    0 讨论(0)
  • 2020-12-17 11:40

    I have been through this and have settled to using a UDF:

    from pyspark.sql.functions import udf
    from pyspark.sql.types import BooleanType
    
    filtered_df = spark_df.filter(udf(lambda target: target.startswith('good'), 
                                      BooleanType())(spark_df.target))
    

    More readable would be to use a normal function definition instead of the lambda

    0 讨论(0)
提交回复
热议问题