CSV file with quoted comma can't be correctly split by Python

守給你的承諾、 提交于 2019-12-12 05:31:57

问题


def csv_split() :
    raw = [ 
            '"1,2,3" , "4,5,6" , "456,789"',
            '"text":"a,b,c,d", "gate":"456,789"'
          ]
    cr = csv.reader( raw, skipinitialspace=True )
    for l in cr :
        print len( l ), l

This function outputs following:

3 ['1,2,3 ', '4,5,6 ', '456,789']
6 ['text:"a', 'b', 'c', 'd"', 'gate:"456', '789"']

As you can tell, the first line is correctly split into 3 entries. But the second line is NOT. I would expect the csv reader splits it into two, instead we've got 6 here. I have also thought about regex approaches, but it assumes some specific quoting dialect.

Basically what I want is: just split the string whenever there is a "," that is not quoted in a pair of "".

Is there any quick and general way to do this? I have seen some regex hacks which assumes that every filed is ALWAYS quoted etc. I think I can write a small loop that does this very inefficiently, but would definitely appreciate some more expertly advice. Thanks a lot!


回答1:


CSV isn't a standardized format, but it's common to escape quotation marks by using two "" if they appear inside the text (e.g. "text"":""a,b,c,d"). Python's CSV reader is doing the right thing here, because it assumes this convention. I'm not quite sure what do you expect as output, but here is my try for a very simple CSV reader which might suit your format. Feel free to adapt it accordingly.

raw = [
    '"1,2,3" , "4,5,6" , "456,789"',
    '"text":"a,b,c,d", "gate":"456,789"',
    '1,2,  3,'
]

for line in raw:
    i, quoted, row = 0, False, []
    for j, c in enumerate(line):
        if c == ',' and not quoted:
            row.append(line[i:j].strip())
            i = j + 1
        elif c == '"':
            quoted = not quoted
    row.append(line[i:j+1].strip())
    for i in range(len(row)):
        if len(row[i]) >= 2 and row[i][0] == '"' and row[i][-1] == '"':
            row[i] = row[i][1:-1] # remove quotation marks
    print row

Output:

['1,2,3', '4,5,6', '456,789']
['text":"a,b,c,d', 'gate":"456,789']
['1', '2', '3', '']



回答2:


Leaving this here for posterity, because I struggled with this for a bit too.

The quotechar argument to csv.reader() helps resolve this; it'll let you ignore delims (i.e. commas, in this scenario) if they're inside quotes (assuming that all commas inside entries have been quoted). That is, it'll work for this:

Name, Message
Ford Prefect, Imagine this fork as the temporal universe.
Arthur Dent, "Hey, I was using that!" 

...where the comma has been nested inside quotes, but the non-comma'd string has not.

Demo code ripped from the Py2 docs, and edited so that delimiter is a comma (duh) and quotechar is your double-quote ":

import csv
with open('eggs.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        print ', '.join(row)


来源:https://stackoverflow.com/questions/11388272/csv-file-with-quoted-comma-cant-be-correctly-split-by-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!