Split a string ignoring quoted sections

前端 未结 13 2346
别跟我提以往
别跟我提以往 2020-12-06 00:15

Given a string like this:

a,\"string, with\",various,\"values, and some\",quoted

What is a good algorithm to split this based on

相关标签:
13条回答
  • 2020-12-06 00:36

    Here's a simple algorithm:

    1. Determine if the string begins with a '"' character
    2. Split the string into an array delimited by the '"' character.
    3. Mark the quoted commas with a placeholder #COMMA#
      • If the input starts with a '"', mark those items in the array where the index % 2 == 0
      • Otherwise mark those items in the array where the index % 2 == 1
    4. Concatenate the items in the array to form a modified input string.
    5. Split the string into an array delimited by the ',' character.
    6. Replace all instances in the array of #COMMA# placeholders with the ',' character.
    7. The array is your output.

    Heres the python implementation:
    (fixed to handle '"a,b",c,"d,e,f,h","i,j,k"')

    def parse_input(input):
    
        quote_mod = int(not input.startswith('"'))
    
        input = input.split('"')
        for item in input:
            if item == '':
                input.remove(item)
        for i in range(len(input)):
            if i % 2 == quoted_mod:
                input[i] = input[i].replace(",", "#COMMA#")
    
        input = "".join(input).split(",")
        for item in input:
            if item == '':
                input.remove(item)
        for i in range(len(input)):
            input[i] = input[i].replace("#COMMA#", ",")
        return input
    
    # parse_input('a,"string, with",various,"values, and some",quoted')
    #  -> ['a,string', ' with,various,values', ' and some,quoted']
    # parse_input('"a,b",c,"d,e,f,h","i,j,k"')
    #  -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']
    
    0 讨论(0)
  • 2020-12-06 00:37

    Looks like you've got some good answers here.

    For those of you looking to handle your own CSV file parsing, heed the advice from the experts and Don't roll your own CSV parser.

    Your first thought is, "I need to handle commas inside of quotes."

    Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

    It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.

    0 讨论(0)
  • 2020-12-06 00:37

    This is a standard CSV-style parse. A lot of people try to do this with regular expressions. You can get to about 90% with regexes, but you really need a real CSV parser to do it properly. I found a fast, excellent C# CSV parser on CodeProject a few months ago that I highly recommend!

    0 讨论(0)
  • 2020-12-06 00:38

    I just couldn't resist to see if I could make it work in a Python one-liner:

    arr = [i.replace("|", ",") for i in re.sub('"([^"]*)\,([^"]*)"',"\g<1>|\g<2>", str_to_test).split(",")]
    

    Returns ['a', 'string, with', 'various', 'values, and some', 'quoted']

    It works by first replacing the ',' inside quotes to another separator (|), splitting the string on ',' and replacing the | separator again.

    0 讨论(0)
  • 2020-12-06 00:41

    Since you said language agnostic, I wrote my algorithm in the language that's closest to pseudocode as posible:

    def find_character_indices(s, ch):
        return [i for i, ltr in enumerate(s) if ltr == ch]
    
    
    def split_text_preserving_quotes(content, include_quotes=False):
        quote_indices = find_character_indices(content, '"')
    
        output = content[:quote_indices[0]].split()
    
        for i in range(1, len(quote_indices)):
            if i % 2 == 1: # end of quoted sequence
                start = quote_indices[i - 1]
                end = quote_indices[i] + 1
                output.extend([content[start:end]])
    
            else:
                start = quote_indices[i - 1] + 1
                end = quote_indices[i]
                split_section = content[start:end].split()
                output.extend(split_section)
    
            output += content[quote_indices[-1] + 1:].split()                                                                 
    
        return output
    
    0 讨论(0)
  • 2020-12-06 00:43

    What if an odd number of quotes appear in the original string?

    This looks uncannily like CSV parsing, which has some peculiarities to handling quoted fields. The field is only escaped if the field is delimited with double quotations, so:

    field1, "field2, field3", field4, "field5, field6" field7

    becomes

    field1

    field2, field3

    field4

    "field5

    field6" field7

    Notice if it doesn't both start and end with a quotation, then it's not a quoted field and the double quotes are simply treated as double quotes.

    Insedently my code that someone linked to doesn't actually handle this correctly, if I recall correctly.

    0 讨论(0)
提交回复
热议问题