I am reading a csv file in Pyspark as follows:
df_raw=spark.read.option(\"header\",\"true\").csv(csv_path)
However, the data file has quote
Delimiter(comma
) specified inside quotes
will be ignored by default. Spark SQL does have inbuilt CSV reader in Spark 2.0.
df = session.read
.option("header", "true")
.csv("csv/file/path")
more about CSV reader here - .
For anyone doing this in Scala: Tagar's answer nearly worked for me (thank you!); all I had to do was escape the double quote when setting my option param:
.option("quote", "\"")
.option("escape", "\"")
I'm using Spark 2.3, so I can confirm Tagar's solution still seems to work the same under the new release.
I noticed that your problematic line has escaping that uses double quotes themselves:
"32 XIY ""W"" JK, RE LK"
which should be interpreter just as
32 XIY "W" JK, RE LK
As described in RFC-4180, page 2 -
That's what Excel does, for example, by default.
Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character:
.option("quote", "\"")
.option("escape", "\"")
This may explain that a comma character wasn't interpreted as it was inside a quoted column.
Options for Spark csv format are not documented well on Apache Spark site, but here's a bit older documentation which I still find useful quite often:
https://github.com/databricks/spark-csv
Update Aug 2018: Spark 3.0 might change this behavior to be RFC-compliant. See SPARK-22236 for details.