I thought this will be trivial but I can\'t get this to work.
Assume a line in a CSV file:
\"Barack Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC
That would be the expected behavior as quotes are just another string character in C#. Looks like what you are after is the quoted tokens or numeric tokens.
I think you might need to use Regex to split the strings unless some one else knows a better way.
Or you could just loop through the string one character at a time building up the string as you go and build the tokens that way. It's old school but may be the most reliable way in your case.
You might have to write your own split function.
"
character, toggle a booleanHere's an example:
public static class StringExtensions
{
public static string[] SplitQuoted(this string input, char separator, char quotechar)
{
List<string> tokens = new List<string>();
StringBuilder sb = new StringBuilder();
bool escaped = false;
foreach (char c in input)
{
if (c.Equals(separator) && !escaped)
{
// we have a token
tokens.Add(sb.ToString().Trim());
sb.Clear();
}
else if (c.Equals(separator) && escaped)
{
// ignore but add to string
sb.Append(c);
}
else if (c.Equals(quotechar))
{
escaped = !escaped;
sb.Append(c);
}
else
{
sb.Append(c);
}
}
tokens.Add(sb.ToString().Trim());
return tokens.ToArray();
}
}
Then just call:
string[] tokens = line.SplitQuoted(',','\"');
Results of benchmarking my code and Dan Tao's code are below. I'm happy to benchmark any other solutions if people want them?
Code:
string input = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\""; // Console.ReadLine()
string[] tokens = null;
// run tests
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
tokens = input.SplitWithQualifier(',', '\"', false);
Console.WriteLine("1,000,000 x SplitWithQualifier = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
start = DateTime.Now;
for (int i = 0; i<1000000;i++)
tokens = input.SplitQuoted(',', '\"');
Console.WriteLine("1,000,000 x SplitQuoted = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
Output:
1,000,000 x SplitWithQualifier = 8156.25ms
1,000,000 x SplitQuoted = 2406.25ms
I have a SplitWithQualifier
extension method that I use here and there, which utilizes Regex
.
I make no claim as to the robustness of this code, but it has worked all right for me for a while.
// mangled code horribly to fit without scrolling
public static class CsvSplitter
{
public static string[] SplitWithQualifier(this string text,
char delimiter,
char qualifier,
bool stripQualifierFromResult)
{
string pattern = string.Format(
@"{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
Regex.Escape(delimiter.ToString()),
Regex.Escape(qualifier.ToString())
);
string[] split = Regex.Split(text, pattern);
if (stripQualifierFromResult)
return split.Select(s => s.Trim().Trim(qualifier)).ToArray();
else
return split;
}
}
Usage:
string csv = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\"";
string[] values = csv.SplitWithQualifier(',', '\"', true);
foreach (string value in values)
Console.WriteLine(value);
Output:
Barak Obama
48
President
1600 Penn Ave, Washington DC
I would recommend using a regular expression instead. It will allow you to extract more complicated substrings in a much more versatile manner (precisely as you want).
http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx
http://oreilly.com/windows/archive/csharp-regular-expressions.html
I see from the bigger picture that you are actually trying to parse CSV input. So instead of advising on how to split the string properly, I would instead recommend you to use a CSV parser to do this kind of thing.
One that I would recommend is the library (source code available) that you can get from this CodeProject page: http://www.codeproject.com/KB/database/CsvReader.aspx
I personally use it myself and like it. It's a .NET native code and a lot faster than using OLEDB (which also can do the CSV parsing for you, but believe me, it's slow).
You can't parse a CSV line with a simple Split on commas, because some cell contents will contain commas that aren't meant to delineate data but are actually part of the cell contents themselves.
Here is a link to a simple regex-based C# method that will convert your CSV into a handly DataTable
:
http://www.hotblue.com/article0000.aspx?a=0006
Working with DataTables is very easy - let me know if you need a code sample for that.