Spark SQL removing white spaces

折月煮酒 提交于 2019-11-29 18:17:49

The CSV writer trims leading and trailing spaces by default. You can turn it off with

   sqlCtx.sql("select * from sourceView").write.
       option("header", true).
       option("ignoreLeadingWhiteSpace",false). // you need this
       option("ignoreTrailingWhiteSpace",false). // and this
       format("csv").save("/my/file/location")

this works for me. If it didn't work for you, can you post what you tried, also, which spark version are you using ? They introduced this feature just last year if I remember right.

For Apache Spark 2.2+ you simply use "ignoreLeadingWhiteSpace" and "ignoreTrailingWhiteSpace" options (see details in @Roberto Congiu's answer)

I guess it should be default behaviour for the lower Apache Spark versions - i'm not sure though.

For Apache Spark 1.3+ you can use "univocity" parserLib in order to specify it explicitly:

df.write
  .option("parserLib","univocity")
  .option("ignoreLeadingWhiteSpace","false")
  .option("ignoreTrailingWhiteSpace","false")
  .format("csv")

Old "incorrect" answer - shows how to get rid of leading and trailing spaces and tabs in the whole data frame (in all columns)

Here is a scala solution:

Source DF:

scala> val df = spark.read.json("file:///temp/a.json")
df: org.apache.spark.sql.DataFrame = [key: string, value1: string ... 1 more field]

scala> df.show
+---+-----------------+-----------------+
|key|           value1|           value2|
+---+-----------------+-----------------+
| k1|      Good String|      Good String|
| k1|With Spaces      |With Spaces      |
| k1|        with tab   |        with tab       |
+---+-----------------+-----------------+

Solution:

import org.apache.spark.sql.functions._

val df2 = df.select(df.columns.map(c => regexp_replace(col(c),"(^\\s+|\\s+$)","").alias(c)):_*)

Result:

scala> df2.show
+---+----------+----------+
|key|    value1|    value2|
+---+----------+----------+
| k1|GoodString|GoodString|
| k1|WithSpaces|WithSpaces|
| k1|   withtab|   withtab|
+---+----------+----------+

PS it should be very similar in Java Spark...

// hope these two options can solve your question
spark.read.json(inputPath).write
    .option("ignoreLeadingWhiteSpace",false)
    .option("ignoreTrailingWhiteSpace", false)
    .csv(outputPath)

You can check the link below to get more info

https://issues.apache.org/jira/browse/SPARK-18579

https://github.com/apache/spark/pull/17310

Thanks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!