Regex with non-capturing group in C#

走远了吗. 提交于 2019-12-06 18:26:37

问题


I am using the following Regex

JOINTS.*\s*(?:(\d*\s*\S*\s*\S*\s*\S*)\r\n\s*)*

on the following type of data:

 JOINTS               DISPL.-X               DISPL.-Y               ROTATION


     1            0.000000E+00           0.975415E+01           0.616921E+01
     2            0.000000E+00           0.000000E+00           0.000000E+00

The idea is to extract two groups, each containing a line (starting with the Joint Number, 1, 2, etc.) The C# code is as follows:

string jointPattern = @"JOINTS.*\s*(?:(\d*\s*\S*\s*\S*\s*\S*)\r\n\s*)*";
MatchCollection mc = Regex.Matches(outFileSection, jointPattern );
foreach (Capture c in mc[0].Captures)
{
    JointOutput j = new JointOutput();
    string[] vals = c.Value.Split();
    j.Joint = int.Parse(vals[0]) - 1;
    j.XDisplacement = float.Parse(vals[1]);
    j.YDisplacement = float.Parse(vals[2]);
    j.Rotation = float.Parse(vals[3]);
    joints.Add(j);
}

However, this does not work: rather than returning two captured groups (the inside group), it returns one group: the entire block, including the column headers. Why does this happen? Does C# deal with un-captured groups differently?

Finally, are RegExes the best way to do this? (I really do feel like I have two problems now.)


回答1:


mc[0].Captures is equivalent to mc[0].Groups[0].Captures. Groups[0] always refers to the whole match, so there will only ever be the one Capture associated with it. The part you're looking for is captured in group #1, so you should be using mc[0].Groups[1].Captures.

But your regex is designed to match the whole input in one attempt, so the Matches() method will always return a MatchCollection with only one Match in it (assuming the match is successful). You might as well use Match() instead:

  Match m = Regex.Match(source, jointPattern);
  if (m.Success)
  {
    foreach (Capture c in m.Groups[1].Captures)
    {
      Console.WriteLine(c.Value);
    }
  }

output:

1            0.000000E+00           0.975415E+01           0.616921E+01
2            0.000000E+00           0.000000E+00           0.000000E+00



回答2:


I would just not use Regex for heavy lifting and parse the text.

var data = @"     JOINTS               DISPL.-X               DISPL.-Y               ROTATION


         1            0.000000E+00           0.975415E+01           0.616921E+01
         2            0.000000E+00           0.000000E+00           0.000000E+00";

var lines = data.Split('\r', '\n').Where(s => !string.IsNullOrWhiteSpace(s));
var regex = new Regex(@"(\S+)");

var dataItems = lines.Select(s => regex.Matches(s)).Select(m => m.Cast<Match>().Select(c => c.Value));




回答3:


Why not just capture the values and ignore the rest. Here is a regex which gets the values.

string data = @"JOINTS DISPL.-X DISPL.-Y ROTATION
 1 0.000000E+00 0.975415E+01 0.616921E+01
 2 0.000000E+00 0.000000E+00 0.000000E+00";

string pattern = @"^
\s+
 (?<Joint>\d+)
\s+
 (?<ValX>[^\s]+)
\s+
 (?<ValY>[^\s]+)
\s+
 (?<Rotation>[^\s]+)";

var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
                  .OfType<Match>()
                  .Select (mt => new
                  {
                    Joint = mt.Groups["Joint"].Value,
                    ValX  = mt.Groups["ValX"].Value,
                    ValY  = mt.Groups["ValY"].Value,
                    Rotation = mt.Groups["Rotation"].Value,
                  });
/* result is
IEnumerable<> (2 items)
Joint ValX ValY Rotation
1 0.000000E+00 0.975415E+01 0.616921E+01
2 0.000000E+00 0.000000E+00 0.000000E+00
*/



回答4:


There's two problems: The repeating part (?:...) is not matching properly; and the .* is greedy and consumes all the input, so the repeating part never matches even if it could.

Use this instead:

JOINTS.*?[\r\n]+(?:\s*(\d+\s*\S*\s*\S*\s*\S*)[\r\n\s]*)*

This has a non-greedy leading part, ensures that the line-matching part starts on a new line (not in the middle of a title), and uses [\r\n\s]* in case the newlines are not exactly as you expect.

Personally, I would use regexes for this, but I like regexes :-) If you happen to know that the structure of the string will always be [title][newline][newline][lines] then perhaps it's more straightforward (if less flexible) to just split on newlines and process accordingly.

Finally, you can use regex101.com or one of the many other regex testing sites to help debug your regular expressions.



来源:https://stackoverflow.com/questions/15170279/regex-with-non-capturing-group-in-c-sharp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!