I have a data frame (business_df) of schema:
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element
Here's a function that removes all whitespace in a string:
import pyspark.sql.functions as F
def remove_all_whitespace(col):
return F.regexp_replace(col, "\\s+", "")
You can use the function like this:
actual_df = source_df.withColumn(
"words_without_whitespace",
quinn.remove_all_whitespace(col("words"))
)
The remove_all_whitespace function is defined in the quinn library. quinn also defines single_space and anti_trim methods to manage whitespace. PySpark defines ltrim, rtrim, and trim methods to manage whitespace.