I am trying to filter an RDD based like below:
spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(
convert the dataframe into rdd.
spark_df = sc.createDataFrame(pandas_df)
spark_df.rdd.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
I think it may work!
DataFrame.filter
, which is an alias for DataFrame.where
, expects a SQL expression expressed either as a Column
:
spark_df.filter(col("target").like("good%"))
or equivalent SQL string:
spark_df.filter("target LIKE 'good%'")
I believe you're trying here to use RDD.filter
which is completely different method:
spark_df.rdd.filter(lambda r: r['target'].startswith('good'))
and does not benefit from SQL optimizations.
I have been through this and have settled to using a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
filtered_df = spark_df.filter(udf(lambda target: target.startswith('good'),
BooleanType())(spark_df.target))
More readable would be to use a normal function definition instead of the lambda