Heading identification with Regex

问题

I'm wondering how I can identify headings with differing numerical marking styles with one or more regular expressions assuming sometimes styles overlap between documents. The goal is to extract all the subheadings and data for a specific heading in each file, but these files aren't standardized. Is regular expressions even the right approach here?

I'm working on a program that parses a .pdf file and looks for a specific section. Once it finds the section it finds all subsections of that section and their content and stores it in a dictionary<string, string>. I start by reading the entire pdf into a string, and then use this function to locate the "marking" section.

private string GetMarkingSection(string text)
    {
      int startIndex = 0;
      int endIndex = 0;
      bool startIndexFound = false;
      Regex rx = new Regex(HEADINGREGEX);
      foreach (Match match in rx.Matches(text))
      {
        if (startIndexFound)
        {
          endIndex = match.Index;
          break;
        }
        if (match.ToString().ToLower().Contains("marking"))
        {
          startIndex = match.Index;
          startIndexFound = true;
        }
      }
      return text.Substring(startIndex, (endIndex - startIndex));
    }

Once the marking section is found, I use this to find subsections.

private Dictionary<string, string> GetSubsections(string text)
    {
      Dictionary<string, string> subsections = new Dictionary<string, string>();
      string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX);
      string title = "";
      string content = "";
      foreach(string s in unprocessedSubSecs)
      {
        if(s != "") //sometimes it pulls in empty strings
        {
          Match m = Regex.Match(s, SUBSECTIONREGEX);
          if (m.Success)
          {
            title = s;
          }
          else
          {
            content = s;
            if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title))
            {
              subsections.Add(title, content);
            }
          }
        }
      }
      return subsections;
    }

Getting these methods to work the way I want them to isn't an issue, the problem is getting them to work with each of the documents. I'm working on a commercial application so any API that requires a license isn't going to work for me. These documents are anywhere from 1-16 years old, so the formatting varies quite a bit. Here is a link to some sample headings and subheadings from various documents. But to make it easy, here are the regex patterns I'm using:

Heading: (?m)^(\d+\.\d+\s[ \w,\-]+)\r?$
Subheading: (?m)^(\d\.[\d.]+ ?[ \w]+) ?\r?$
Master Key: (?m)^(\d\.?[\d.]*? ?[ \-,:\w]+) ?\r?$

Since some headings use the subheading format in other documents I am unable to use the same heading regex for each file, and the same goes for my subheading regex.

My alternative to this was that I was going to write a master key (listed in the regex link) to identify all types of headings and then locate the last instance of a numeric character in each heading (5.1.X) and then look for 5.1.X+1 to find the end of that section.

That's when I ran into another problem. Some of these files have absolutely no proper structure. Most of them go from 5.2->7.1.5 (5.2->5.3/6.0 would be expected)

I'm trying to wrap my head around a solution for something like this, but I've got nothing... I am open to ideas not involving regex as well.

Here is my updated GetMarkingSection method:

private Dictionary<string, string> GetMarkingSection(string text)
    {
      var headingRegex = HEADING1REGEX;
      var subheadingRegex = HEADING2REGEX;
      Dictionary<string, string> markingSection = new Dictionary<string, string>();

      if (Regex.Matches(text, HEADING1REGEX, RegexOptions.Multiline | RegexOptions.Singleline).Count > 0)
      {
        foreach (Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
        {
          if (Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
          {
            if (m.Groups[2].Value.ToLower().Contains("marking"))
            {
              var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
              foreach (Match s in subheadings)
              {
                markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
              }
              return markingSection;
            }
          }
        }
      }
      else
      {
        headingRegex = HEADING2REGEX;
        subheadingRegex = HEADING3REGEX;

        foreach(Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
        {
          if(Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
          {
            if (m.Groups[2].Value.ToLower().Contains("marking"))
            {
              var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
              foreach (Match s in subheadings)
              {
                markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
              }
              return markingSection;
            }
          }
        }
      }
      return null;
    }

Here are some example PDF files:

回答1:

See if this approach works:

var heading1Regex = @"^(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\s|\Z)";

Demo

var heading2Regex = @"^(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\s|\Z)";

Demo

var heading3Regex = @"^(\d+)\.(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\.\d+\s|\Z)";

Demo

For each pdf file:

var headingRegex = heading1Regex;
var subHeadingRegex = heading2Regex;

if there are any matches for headingRegex
{
    for each match, find matches for subHeadingRegex
}
else
{
    var headingRegex = heading2Regex;
    var subHeadingRegex = heading3Regex;
    //repeat same steps
}

1. Edge case 1: after 5.2, comes 7.1.3

As shown here, get main section match using heading2Regex.

convert group1 of the match to integer

int.TryParse(match.group1, out var headingIndex);

get sub section matches for heading3Regex

for each subsection match, convert group1 to integer.

int.TryParse(match.group1, out var subHeadingIndex);

check if headingIndex is equal to subHeadingIndex. if not handle accordingly.

来源：https://stackoverflow.com/questions/50768051/heading-identification-with-regex

标签

regex

parsing

pdf