I have string with some special characters. The aim is to retrieve String[] of each line (, separated) You have special character “ where you can have /n and ,
Try this:
String source = "Alpha,Beta,Gama,\"23-5-2013,TOM\",TOTO,\"Julie, KameL\n"
+ "Titi\",God,\" timmy, tomy,tony,\n"
+ "tini\".";
Pattern p = Pattern.compile("(([^\"][^,]*)|\"([^\"]*)\"),?");
Matcher m = p.matcher(source);
while(m.find())
{
if(m.group(2) != null)
System.out.println( m.group(2).replace("\n", "") );
else if(m.group(3) != null)
System.out.println( m.group(3).replace("\n", "") );
}
If it matches a string without quotes, the result is returned in group 2. Strings with quotes are returned in group 3. Hence i needed a distinction in the while-block. You might find a prettier way.
Output:
Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie, KameLTiti
God
timmy, tomy,tony,tini
.
Consider the following powershell example of a universal regex tested on a Java parser which requires no extra processing to reassemble the data parts. The first matching group will match a quote, then carry that to the end of the match so that you're assured to capture the entire value between but not including the quotes. I also don't capture the commas unless they were embedded a quote delimited substring.
(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)
$Matches = @()
$String = 'Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"'
$Regex = '(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)'
Write-Host start with
write-host $String
Write-Host
Write-Host found
([regex]"(?i)(?m)$Regex").matches($String) | foreach {
write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'`t= value at $($_.Groups[2].Index) = '$($_.Groups[2].Value)'"
} # next match
start with
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"
found
key at 0 = '' = value at 0 = 'Alpha'
key at 6 = '' = value at 6 = 'Beta'
key at 11 = '' = value at 11 = 'Gama'
key at 16 = '"' = value at 17 = '23-5-2013,TOM'
key at 32 = '' = value at 32 = 'TOTO'
key at 37 = '"' = value at 38 = 'Julie, KameL\n
Titi'
key at 60 = '' = value at 60 = 'God'
key at 64 = '"' = value at 65 = 'timmy, \n
tomy,tony,tini'

(?: start non capture group^ require start of string| or ,\s{0,} a comma followed by any number of white space) close the non capture group( start capture group 1["]? consume a quote if it exists, I like doing it this way incase you want to include other characters then a quote ) close capture group 1\s{0,} consume any spaces if they exist, this means you don't need to trim the value later( start capture group 2(?:.|\n|\r)*? capture all characters including a new line, non greedy) close capture group 2\1 if there was a quote it would be stored in group 1, so if there was one then require it here(?= start zero assertion look ahead[,]\s{0,} must have a comma followed by optional whitespace| or$ end of the string) close the zero assertion look aheadSee this related answer for a decent Java-compatible regex for parsing CSV.
It recognizes:
""this""In short, you will use this pattern: (?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))
Then collect each Matcher group(1) in a find() loop.
Note: Although I have posted this answer here about a "decent" regex I discovered, just to save people searching for one, it is by no means robust. I still agree with this answer by user "fgv": a CSV Parser is preferrable.
Parsing CSV is a whole lot harder than one would imagine at first sight, and that's why your best option is to use a well-designed and tested library to do that work for you. Two libraries are opencsv and supercsv, and many others. Have a look at both and use the one that's the best fit to your requirements and style.