I am reading a csv file in Pyspark as follows:
df_raw=spark.read.option(\"header\",\"true\").csv(csv_path)
However, the data file has quote
I noticed that your problematic line has escaping that uses double quotes themselves:
"32 XIY ""W"" JK, RE LK"
which should be interpreter just as
32 XIY "W" JK, RE LK
As described in RFC-4180, page 2 -
That's what Excel does, for example, by default.
Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character:
.option("quote", "\"")
.option("escape", "\"")
This may explain that a comma character wasn't interpreted as it was inside a quoted column.
Options for Spark csv format are not documented well on Apache Spark site, but here's a bit older documentation which I still find useful quite often:
https://github.com/databricks/spark-csv
Update Aug 2018: Spark 3.0 might change this behavior to be RFC-compliant. See SPARK-22236 for details.