Find dynamic words through patterns in LINQ

问题

Here is how the html starts

BUSINESS DOCUMENTATION

<p>Some company</p>
<p>
<p>DEPARTMENT: Legal Process</p>
<p>FUNCTION: Computer Department</p>
<p>PROCESS: Process Server</p>
<p>PROCEDURE: ABC Process Server</p>
<p>OWNER: Some User</p>
<p>REVISION DATE: 06/10/2013</p>
<p>
<p>OBJECTIVE: To ensure that the process server receive their invoices the following day.</p>
<p>
<p>WHEN TO PERFORM: Daily</p>
<p>
<p>WHO WILL PERFORM? Computer Team</p>
<p>
<p>TIME TO COMPLETE: 5 minutes</p>
<p>
<p>TECHNOLOGY REQUIREMENT(S): </p>
<p>
<p>SOURCE DOCUMENT(S): N/A</p>
<p>
<p>CODES AND DEFINITIONS: N/A</p>
<p>
<table border="1">
  <tr>
    <td>
      <p>KPI&rsquo;s: </p>
    </td>
  </tr>
</table>
<p>
<table border="1">
  <tr>
    <td>
      <p>RISKS:  </p>
    </td>
  </tr>
</table>

After this there is a whole bunch of text. What I need to do is from the above I need to parse out specific data.

I need to parse out the Department, Function, Process, Procedure. Objective, When to Perform, Who Will Perform, Time To Complete, Technology Requirements, Source Documents, Codes and Definitions, Risks.

I then need to delete this information from the Html column while leaving everything else in-tact. Is this possible in LINQ?

Here is the LINQ query I am using:

var result = (from d in IPACS_Documents
join dp in IPACS_ProcedureDocs on d.DocumentID equals dp.DocumentID
join p in IPACS_Procedures on dp.ProcedureID equals p.ProcedureID
where d.DocumentID == 4
&& d.DateDeleted == null
select d.Html);

Console.WriteLine(result);

回答1:

This regex worked just fine for me on your input data

(DEPARTMENT|FUNCTION|OBJECTIVE):\s*(?<value>.+)\<

The result is multiple Matches with 2 groups each - the first the key and the second the value. I have only handled two cases, but you can add the rest easily enough.

To remove the information thus parsed, you can do a Regex.Replace with this regex

(?$DEPARTMENT|FUNCTION|OBJECTIVE):\s*)(?.+)(?$

and replacement string as

${start}${end}

leaving out value.

In code, this looks kinda like this (quickly typed this out in Notepad++ - may have minor errors).

private static readonly ParseDocRegex = new Regex(@"(?<start>\<p\>(?<name>DEPARTMENT|FUNCTION|OBJECTIVE):\s*)(?<value>.+)(?<end>\</p\>)", RegexOptions.ExplicitCaptured | RegexOptions.Compiled);

...

from html in result
    let matches = findValuesRegex.Match(html)
    where matches.Success
    select new
    {
        namesAndValues = from m in matches.AsType<Match>() 
        select new KeyValuePair<string, string>(m.Groups["name"].Value, m.Groups["value"].Value),
        strippedHtml = ParseDocRegex.Replace(html, "${start}${end}")
    };

This ought to give you the desired output.

回答2:

It can be done with many LINQ statements but using regular expressions you need only a few lines of code.

回答3:

For HTML, you need an HTML parser. Try HTML Agility Pack or CsQuery.

Regular expressions can handle simple matches against HTML but are not sufficient for hierarchical structures and queries would be less precise.

Any HTML extraction is going to be fragile as the structure of the HTML charges. HTML is a presentation format and creators seldom care about machine interpretation. At least with a parser, you'll get an accurate model for the presentation markup (assuming it is valid HTML). You'll also get translation of entities into characters and the ability to extract all the descendant text of an element without internal markup elements like bold or italics.

You can use arbitrary assemblies in LINQPad simply by adding a reference, and for expression-based script, you can import designated namespaces automatically.

来源：https://stackoverflow.com/questions/18387665/find-dynamic-words-through-patterns-in-linq

标签

linq

linqpad