I have a data frame (business_df
) of schema:
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element
While the problem you've described is not reproducible with provided code, using Python UDFs
to handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace
:
from pyspark.sql.functions import regexp_replace, col
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("v"), " ", ""))
If you want to normalize empty lines use trim
:
from pyspark.sql.functions import trim
df.select(trim(col("v")))
If you want to keep leading / trailing spaces you can adjust regexp_replace
:
df.select(regexp_replace(col("v"), "^\s+$", ""))