How do I prevent pyspark from interpreting commas as a delimiter in a csv field having JSON object as its value

问题

I am trying to read a comma delimited csv file using pyspark version 2.4.5 and Databrick's spark-csv module. One of the field in the csv file has a json object as its value. The contents of the csv are as below

test.csv

header_col_1, header_col_2, header_col_3
one, two, three
one, {“key1”:“value1",“key2”:“value2",“key3”:“value3”,“key4”:“value4"}, three

Other solutions that I found had read options defined as "escape": '"', and 'delimiter': ",". This seems not to be working as the commas in the field in question are not enclosed in double quotes. Below is the source code that I am using to read the csv file

test.py

from pyspark.sql import SparkSession
import findspark

findspark.init()

spark = SparkSession.builder.appName('test').getOrCreate()

read_options = {
    'header': 'true',
    "escape": '"',
    'delimiter': ",",
    'inferSchema': 'false',
}

spark_df = spark.read.format('com.databricks.spark.csv').options(**read_options).load('test.csv')

print(spark_df.show())

Output of the above program is as shown below

+------------+-----------------+---------------+
|header_col_1|     header_col_2|   header_col_3|
+------------+-----------------+---------------+
|         one|              two|          three|
|         one| {“key1”:“value1"|“key2”:“value2"|
+------------+-----------------+---------------+

回答1:

In the CSV file, you have to put the JSON string in straight double quotes. The double quotes in your JSON string must be escaped by backslashes (\"). Remove your escape option as it is incorrect. By default, the delimiter is set to "," the escape character to '\' and the quote character to '"'. Refer to Databricks documentation

回答2:

Delimiters between double quotes are ignored by default.

The solution to the issue is not so elegant and I guess it can be improved. What worked for me was a two-step process, the first step was reading the file as text using pyspark spark.read.text() method. The the second step involved manipulating the Json object by replacing any double quotes inside the object with single quotes, wrap whole object in double quotes and then write the contents to a new csv file which I then read using the spark.read.format('com.databricks.spark.csv').options(**read_options).load('new.csv') method.

Below is the code snippet for the program

from pyspark.sql import SparkSession


read_options = {
    'header': 'true',
    'escape': '"',
    'delimiter': ",",
    'inferSchema': 'false',
}


spark = SparkSession.builder.appName('test').getOrCreate()
sc = spark.sparkContext

lines = sc.textFile("test.csv").collect()

new_data = [
    line.replace(' ', '').replace('“', "'").replace('”',  "'").replace('"',  "'").replace('{', '"{').replace('}', '}"') + '\n'
    for line in lines]

with open('new.csv', 'w') as new_file:
    new_file.writelines(new_data)

spark_df = spark.read.format('com.databricks.spark.csv').options(**read_options).load('new.csv')
spark_df.show(3, False)

The the above program produce the output below

+------------+-----------------------------------------------------------------+------------+
|header_col_1|header_col_2                                                     |header_col_3|
+------------+-----------------------------------------------------------------+------------+
|one         |two                                                              |three       |
|one         |{'key1':'value1','key2':'value2','key3':'value3','key4':'value4'}|three       |
+------------+-----------------------------------------------------------------+------------+

来源：https://stackoverflow.com/questions/63042848/how-do-i-prevent-pyspark-from-interpreting-commas-as-a-delimiter-in-a-csv-field

标签

python

csv

pyspark