Trim string column in PySpark dataframe

I'm beginner on Python and Spark. After creating a DataFrame from CSV file, I would like to know how I can trim a column. I've try:

df = df.withColumn("Product", df.Product.strip())

df is my data frame, Product is a column in my table

But I see always the error:

Column object is not callable

Do you have any suggestions?

Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions first. Here is an example:

 from pyspark.sql import SQLContext
 from pyspark.sql.functions import *
 sqlContext = SQLContext(sc)

 df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings
 df.collect()
 # [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')]
 df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1
 df.collect()
 # [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')]
 df = df.withColumn('d1', rtrim(df.d1))  # trim right whitespace from d1
 df.collect()
 # [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')]

from pyspark.sql.functions import trim

df = df.withColumn("Product", trim(col("Product")))

The pyspark version of the strip function is called trim. Trim will "trim the spaces from both ends for the specified string column". Make sure to import the function first and to put the column you are trimming inside your function.

The following should work:

from pyspark.sql.functions import trim
df = df.withColumn("Product", trim(df.Product))

I did that with the udf like this:

from pyspark.sql.functions import udf

def trim(string):
    return string.strip()
trim=udf(trim)

df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2'])

df2 = df.select(trim(df['d1']).alias('d1'),trim(df['d2']).alias('d2'))

output looks like this:

df.show()
df2.show()
+------------+------------+
|          d1|          d2|
+------------+------------+
| 2015-04-08 | 2015-05-10 |
+------------+------------+

+----------+----------+
|        d1|        d2|
+----------+----------+
|2015-04-08|2015-05-10|
+----------+----------+

来源：https://stackoverflow.com/questions/35155821/trim-string-column-in-pyspark-dataframe

标签

apache-spark

pyspark

apache-spark-sql

trim

pyspark-sql