Parsing comma-separated values containing quoted commas and newlines

前端 未结 4 653
温柔的废话
温柔的废话 2020-12-20 09:33

I have string with some special characters. The aim is to retrieve String[] of each line (, separated) You have special character “ where you can have /n and ,



        
相关标签:
4条回答
  • 2020-12-20 10:01

    Try this:

    String source = "Alpha,Beta,Gama,\"23-5-2013,TOM\",TOTO,\"Julie, KameL\n"
                  + "Titi\",God,\" timmy, tomy,tony,\n"
                  + "tini\".";
    
    Pattern p = Pattern.compile("(([^\"][^,]*)|\"([^\"]*)\"),?");
    Matcher m = p.matcher(source);
    
    while(m.find())
    {
        if(m.group(2) != null)
            System.out.println( m.group(2).replace("\n", "") );
        else if(m.group(3) != null)
            System.out.println( m.group(3).replace("\n", "") );
    }
    

    If it matches a string without quotes, the result is returned in group 2. Strings with quotes are returned in group 3. Hence i needed a distinction in the while-block. You might find a prettier way.

    Output:
    Alpha
    Beta
    Gama
    23-5-2013,TOM
    TOTO
    Julie, KameLTiti
    God
    timmy, tomy,tony,tini
    .

    0 讨论(0)
  • 2020-12-20 10:09

    Description

    Consider the following powershell example of a universal regex tested on a Java parser which requires no extra processing to reassemble the data parts. The first matching group will match a quote, then carry that to the end of the match so that you're assured to capture the entire value between but not including the quotes. I also don't capture the commas unless they were embedded a quote delimited substring.

    (?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)

    Example

    $Matches = @()
    $String = 'Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
    Titi",God,"timmy, \n
    tomy,tony,tini"'
    $Regex = '(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)'
    
    Write-Host start with 
    write-host $String
    Write-Host
    Write-Host found
    ([regex]"(?i)(?m)$Regex").matches($String) | foreach {
        write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'`t= value at $($_.Groups[2].Index) = '$($_.Groups[2].Value)'"
        } # next match
    

    Yields

    start with
    Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
    Titi",God,"timmy, \n
    tomy,tony,tini"
    
    found
    key at 0 = ''   = value at 0 = 'Alpha'
    key at 6 = ''   = value at 6 = 'Beta'
    key at 11 = ''  = value at 11 = 'Gama'
    key at 16 = '"' = value at 17 = '23-5-2013,TOM'
    key at 32 = ''  = value at 32 = 'TOTO'
    key at 37 = '"' = value at 38 = 'Julie, KameL\n
    Titi'
    key at 60 = ''  = value at 60 = 'God'
    key at 64 = '"' = value at 65 = 'timmy, \n
    tomy,tony,tini'
    

    Summary

    enter image description here

    • (?: start non capture group
    • ^ require start of string
    • | or
    • ,\s{0,} a comma followed by any number of white space
    • ) close the non capture group
    • ( start capture group 1
    • ["]? consume a quote if it exists, I like doing it this way incase you want to include other characters then a quote
    • ) close capture group 1
    • \s{0,} consume any spaces if they exist, this means you don't need to trim the value later
    • ( start capture group 2
    • (?:.|\n|\r)*? capture all characters including a new line, non greedy
    • ) close capture group 2
    • \1 if there was a quote it would be stored in group 1, so if there was one then require it here
    • (?= start zero assertion look ahead
    • [,]\s{0,} must have a comma followed by optional whitespace
    • | or
    • $ end of the string
    • ) close the zero assertion look ahead
    0 讨论(0)
  • 2020-12-20 10:11

    See this related answer for a decent Java-compatible regex for parsing CSV.

    It recognizes:

    • Newlines (after values or inside quoted values)
    • Quoted values containing escaped double-quotes like ""this""

    In short, you will use this pattern: (?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))

    Then collect each Matcher group(1) in a find() loop.


    Note: Although I have posted this answer here about a "decent" regex I discovered, just to save people searching for one, it is by no means robust. I still agree with this answer by user "fgv": a CSV Parser is preferrable.

    0 讨论(0)
  • 2020-12-20 10:13

    Parsing CSV is a whole lot harder than one would imagine at first sight, and that's why your best option is to use a well-designed and tested library to do that work for you. Two libraries are opencsv and supercsv, and many others. Have a look at both and use the one that's the best fit to your requirements and style.

    0 讨论(0)
提交回复
热议问题