问题
I'm beginner on Python and Spark. After creating a DataFrame
from CSV
file, I would like to know how I can trim a column. I've try:
df = df.withColumn("Product", df.Product.strip())
df
is my data frame, Product
is a column in my table
But I see always the error:
Column object is not callable
Do you have any suggestions?
回答1:
Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim
and rtrim
(search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions
first. Here is an example:
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings
df.collect()
# [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')]
df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1
df.collect()
# [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')]
df = df.withColumn('d1', rtrim(df.d1)) # trim right whitespace from d1
df.collect()
# [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')]
回答2:
from pyspark.sql.functions import trim
df = df.withColumn("Product", trim(col("Product")))
回答3:
The pyspark version of the strip function is called trim. Trim will "trim the spaces from both ends for the specified string column". Make sure to import the function first and to put the column you are trimming inside your function.
The following should work:
from pyspark.sql.functions import trim
df = df.withColumn("Product", trim(df.Product))
回答4:
I did that with the udf like this:
from pyspark.sql.functions import udf
def trim(string):
return string.strip()
trim=udf(trim)
df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2'])
df2 = df.select(trim(df['d1']).alias('d1'),trim(df['d2']).alias('d2'))
output looks like this:
df.show()
df2.show()
+------------+------------+
| d1| d2|
+------------+------------+
| 2015-04-08 | 2015-05-10 |
+------------+------------+
+----------+----------+
| d1| d2|
+----------+----------+
|2015-04-08|2015-05-10|
+----------+----------+
来源:https://stackoverflow.com/questions/35155821/trim-string-column-in-pyspark-dataframe