Split a string ignoring quoted sections

前端未结

关注

 13  2378

别跟我提以往

Given a string like this:

a,\"string, with\",various,\"values, and some\",quoted

What is a good algorithm to split this based on

相关标签:

13条回答

终归单人心

2020-12-06 00:36

Here's a simple algorithm:

Determine if the string begins with a '"' character
Split the string into an array delimited by the '"' character.
Mark the quoted commas with a placeholder #COMMA#
- If the input starts with a '"', mark those items in the array where the index % 2 == 0
- Otherwise mark those items in the array where the index % 2 == 1
Concatenate the items in the array to form a modified input string.
Split the string into an array delimited by the ',' character.
Replace all instances in the array of #COMMA# placeholders with the ',' character.
The array is your output.

Heres the python implementation:
(fixed to handle '"a,b",c,"d,e,f,h","i,j,k"')

def parse_input(input):

    quote_mod = int(not input.startswith('"'))

    input = input.split('"')
    for item in input:
        if item == '':
            input.remove(item)
    for i in range(len(input)):
        if i % 2 == quoted_mod:
            input[i] = input[i].replace(",", "#COMMA#")

    input = "".join(input).split(",")
    for item in input:
        if item == '':
            input.remove(item)
    for i in range(len(input)):
        input[i] = input[i].replace("#COMMA#", ",")
    return input

# parse_input('a,"string, with",various,"values, and some",quoted')
#  -> ['a,string', ' with,various,values', ' and some,quoted']
# parse_input('"a,b",c,"d,e,f,h","i,j,k"')
#  -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']

0 讨论(0)

再見小時候

2020-12-06 00:37

Looks like you've got some good answers here.

For those of you looking to handle your own CSV file parsing, heed the advice from the experts and Don't roll your own CSV parser.

Your first thought is, "I need to handle commas inside of quotes."

Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-12-06 00:37

This is a standard CSV-style parse. A lot of people try to do this with regular expressions. You can get to about 90% with regexes, but you really need a real CSV parser to do it properly. I found a fast, excellent C# CSV parser on CodeProject a few months ago that I highly recommend!

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-12-06 00:38
I just couldn't resist to see if I could make it work in a Python one-liner:
```
arr = [i.replace("|", ",") for i in re.sub('"([^"]*)\,([^"]*)"',"\g<1>|\g<2>", str_to_test).split(",")]
```
Returns ['a', 'string, with', 'various', 'values, and some', 'quoted']
It works by first replacing the ',' inside quotes to another separator (|), splitting the string on ',' and replacing the | separator again.
0 讨论(0)
发布评论:

提交评论
- 加载中...

夕颜

2020-12-06 00:41

Since you said language agnostic, I wrote my algorithm in the language that's closest to pseudocode as posible:

def find_character_indices(s, ch):
    return [i for i, ltr in enumerate(s) if ltr == ch]


def split_text_preserving_quotes(content, include_quotes=False):
    quote_indices = find_character_indices(content, '"')

    output = content[:quote_indices[0]].split()

    for i in range(1, len(quote_indices)):
        if i % 2 == 1: # end of quoted sequence
            start = quote_indices[i - 1]
            end = quote_indices[i] + 1
            output.extend([content[start:end]])

        else:
            start = quote_indices[i - 1] + 1
            end = quote_indices[i]
            split_section = content[start:end].split()
            output.extend(split_section)

        output += content[quote_indices[-1] + 1:].split()                                                                 

    return output

0 讨论(0)

慢半拍i

2020-12-06 00:43

What if an odd number of quotes appear in the original string?

This looks uncannily like CSV parsing, which has some peculiarities to handling quoted fields. The field is only escaped if the field is delimited with double quotations, so:

field1, "field2, field3", field4, "field5, field6" field7

becomes

field1

field2, field3

field4

"field5

field6" field7

Notice if it doesn't both start and end with a quotation, then it's not a quoted field and the double quotes are simply treated as double quotes.

Insedently my code that someone linked to doesn't actually handle this correctly, if I recall correctly.

0 讨论(0)
发布评论:

提交评论
- 加载中...