How do I prevent pyspark from interpreting commas as a delimiter in a csv field having JSON object as its value

↘锁芯ラ 提交于 2021-02-19 05:31:36

问题


I am trying to read a comma delimited csv file using pyspark version 2.4.5 and Databrick's spark-csv module. One of the field in the csv file has a json object as its value. The contents of the csv are as below

test.csv

header_col_1, header_col_2, header_col_3
one, two, three
one, {“key1”:“value1",“key2”:“value2",“key3”:“value3”,“key4”:“value4"}, three

Other solutions that I found had read options defined as "escape": '"', and 'delimiter': ",". This seems not to be working as the commas in the field in question are not enclosed in double quotes. Below is the source code that I am using to read the csv file

test.py

from pyspark.sql import SparkSession
import findspark

findspark.init()

spark = SparkSession.builder.appName('test').getOrCreate()

read_options = {
    'header': 'true',
    "escape": '"',
    'delimiter': ",",
    'inferSchema': 'false',
}

spark_df = spark.read.format('com.databricks.spark.csv').options(**read_options).load('test.csv')

print(spark_df.show())

Output of the above program is as shown below

+------------+-----------------+---------------+
|header_col_1|     header_col_2|   header_col_3|
+------------+-----------------+---------------+
|         one|              two|          three|
|         one| {“key1”:“value1"|“key2”:“value2"|
+------------+-----------------+---------------+


回答1:


In the CSV file, you have to put the JSON string in straight double quotes. The double quotes in your JSON string must be escaped by backslashes (\"). Remove your escape option as it is incorrect. By default, the delimiter is set to "," the escape character to '\' and the quote character to '"'. Refer to Databricks documentation




回答2:


Delimiters between double quotes are ignored by default.

The solution to the issue is not so elegant and I guess it can be improved. What worked for me was a two-step process, the first step was reading the file as text using pyspark spark.read.text() method. The the second step involved manipulating the Json object by replacing any double quotes inside the object with single quotes, wrap whole object in double quotes and then write the contents to a new csv file which I then read using the spark.read.format('com.databricks.spark.csv').options(**read_options).load('new.csv') method.

Below is the code snippet for the program

from pyspark.sql import SparkSession


read_options = {
    'header': 'true',
    'escape': '"',
    'delimiter': ",",
    'inferSchema': 'false',
}


spark = SparkSession.builder.appName('test').getOrCreate()
sc = spark.sparkContext

lines = sc.textFile("test.csv").collect()

new_data = [
    line.replace(' ', '').replace('“', "'").replace('”',  "'").replace('"',  "'").replace('{', '"{').replace('}', '}"') + '\n'
    for line in lines]

with open('new.csv', 'w') as new_file:
    new_file.writelines(new_data)

spark_df = spark.read.format('com.databricks.spark.csv').options(**read_options).load('new.csv')
spark_df.show(3, False)

The the above program produce the output below

+------------+-----------------------------------------------------------------+------------+
|header_col_1|header_col_2                                                     |header_col_3|
+------------+-----------------------------------------------------------------+------------+
|one         |two                                                              |three       |
|one         |{'key1':'value1','key2':'value2','key3':'value3','key4':'value4'}|three       |
+------------+-----------------------------------------------------------------+------------+



来源:https://stackoverflow.com/questions/63042848/how-do-i-prevent-pyspark-from-interpreting-commas-as-a-delimiter-in-a-csv-field

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!