Trim string column in PySpark dataframe

僤鯓⒐⒋嵵緔 提交于 2019-12-09 07:57:03

问题


I'm beginner on Python and Spark. After creating a DataFrame from CSV file, I would like to know how I can trim a column. I've try:

df = df.withColumn("Product", df.Product.strip())

df is my data frame, Product is a column in my table

But I see always the error:

Column object is not callable

Do you have any suggestions?


回答1:


Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions first. Here is an example:

 from pyspark.sql import SQLContext
 from pyspark.sql.functions import *
 sqlContext = SQLContext(sc)

 df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings
 df.collect()
 # [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')]
 df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1
 df.collect()
 # [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')]
 df = df.withColumn('d1', rtrim(df.d1))  # trim right whitespace from d1
 df.collect()
 # [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')]



回答2:


from pyspark.sql.functions import trim

df = df.withColumn("Product", trim(col("Product")))



回答3:


The pyspark version of the strip function is called trim. Trim will "trim the spaces from both ends for the specified string column". Make sure to import the function first and to put the column you are trimming inside your function.

The following should work:

from pyspark.sql.functions import trim
df = df.withColumn("Product", trim(df.Product))



回答4:


I did that with the udf like this:

from pyspark.sql.functions import udf

def trim(string):
    return string.strip()
trim=udf(trim)

df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2'])

df2 = df.select(trim(df['d1']).alias('d1'),trim(df['d2']).alias('d2'))

output looks like this:

df.show()
df2.show()
+------------+------------+
|          d1|          d2|
+------------+------------+
| 2015-04-08 | 2015-05-10 |
+------------+------------+

+----------+----------+
|        d1|        d2|
+----------+----------+
|2015-04-08|2015-05-10|
+----------+----------+


来源:https://stackoverflow.com/questions/35155821/trim-string-column-in-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!