Split using delimiter except when delimiter is escaped

吃可爱长大的小学妹 提交于 2019-11-28 12:24:09

First off I've dealt with data from Excel before and what you typically see is comma separated values and if the value is considered to be a string it will have double quotes around it (and can contain commas and double quotes). If it is considered to be numeric then there are not double quotes. Additionally if the data contains a double quote that will be delimited by a double quote like "". So assuming all of that here's how I've dealt with this in the past

public static IEnumerable<string> SplitExcelRow(this string value)
{
    value = value.Replace("\"\"", "&quot;");
    bool quoted = false;
    int currStartIndex = 0;
    for (int i = 0; i < value.Length; i++)
    {
        char currChar = value[i];
        if (currChar == '"')
        {
            quoted = !quoted;       
        }
        else if (currChar == ',')
        {
            if (!quoted)
            {
                yield return value.Substring(currStartIndex, i - currStartIndex)
                    .Trim()
                    .Replace("\"","")
                    .Replace("&quot;","\"");
                currStartIndex = i + 1;
            }
        }
    }
    yield return value.Substring(currStartIndex, value.Length - currStartIndex)
        .Trim()
        .Replace("\"", "")
        .Replace("&quot;", "\"");
}

Of course this assumes the data coming in is valid so if you have something like "fo,o"b,ar","bar""foo" this will not work. Additionally if your data contains &quot; then it will be turned into a " which may or may not be desirable.

There are a lot of ways to do this. One inelegant way that would work is:

  1. Convert \",\" to tab or some other delimiter (I assume you left out a few \" in your example because otherwise the string is not consistent
  2. Strip all remaining commas
  3. Strip all remaining \"
  4. Convert your delimiter (e.g. tab) back into a comma

Now you have what you wanted in first place

I agree with Kyle regarding your string probably not being consistent.

Instead of Kyle's first step you could use

string[] vals = Regex.Split(value, @"\s*\"",\s*");

From your input example, we can see that there are three "unwanted" sequences of characters:

\"
\",
,\"

So, add all these sequences to the input array for the Split method:

string[] result = clipData.Split(new[] { @",\""", @"\"",", @"\""" }, 
    StringSplitOptions.None);

This will give you an array containing a few empty elements. If that is a problem, use StringSplitOptions.RemoveEmptyEntries instead of StringSplitOptions.None:

string[] result = clipData.Split(new[] { @",\""", @"\"",", @"\""" }, 
    StringSplitOptions.RemoveEmptyEntries);

You could try to use a bit of LINQ:

string excelData = "\\\" 1,234,123.00 \\\",\\\" 2,345.00 \\\", 342.00 ,\\\" 12,345.00 \\\"";

IEnumerable<string> cells = from x in excelData.Split(new string[] { "\\\"" }, StringSplitOptions.RemoveEmptyEntries)
                            let y = x.Trim(',').Trim()
                            where !string.IsNullOrWhiteSpace(y)
                            select y;

Alternatively, if you don't like this suggestion, try to implement a similar pattern with RegEx.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!