I have about 2500 html-files of different standards. I need to remove the footer part of them. The HTML-code below is one of my files footer, and I need to remove the two hr-elements and the elements between the two.
So far I have only tried targeting the hr-element with xpath (and HTML Agility Pack) selectSingleNode
and DocumentNode.SelectNodes("//hr");
. And then try to iterate with a foreach.
But I am too much of a noob to use XPath properly, and don't know how to select the node and its siblings(?) to delete them.
This is what I've got so far, with the help of this community. :)
private static void RemoveHR(IEnumerable<string> files)
{
var document = new HtmlDocument();
List<string> hr = new List<string>();
List<string> errors = new List<string>();
int i = 0;
foreach (var file in files)
{
try
{
document.Load(@file);
i++;
var hrs = document.DocumentNode.SelectNodes("//hr");
foreach (var hr in hrs) hr.Remove();
document.Save(@file);
}
catch (Exception Ex)
{
errors.Add(file + "|" + Ex.Message);
}
}
using (StreamWriter logger = File.CreateText(@"D:\websites\dev.openjournal.tld\public\arkivet\ErrorLogs\hr_error_log.txt"))
{
foreach (var file in errors)
{
logger.WriteLine(file);
}
}
int nrOfHr = hr.Count();
int nrOfErrors = errors.Count();
Console.WriteLine("Number of hr elements collected: {0}", nrOfHr);
Console.WriteLine("Number of files missing hr element: {0}", nrOfErrors);
}
HTML-source:
<hr color=#ff00ff SIZE=3> //start element
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<p style="text-align : left; color : black;">Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996) "Stemming and N-gram matching for term conflation in Turkish texts" <em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style="text-align : center">© the authors, 1996.</p>
<hr color="#ff00ff" size="1"><div align="center">Check for citations, <a href="http://scholar.google.co.uk/scholar?hl=en&q=http://informationr.net/ir/2-2/paper13.html&btnG=Search&as_sdt=2000">using Google Scholar</a></div>
<hr color="#ff00ff" size="1">
<table border="0" cellpadding="15" cellspacing="0" align="center">
<tr>
<td><a href="infres22.html"><h4>Contents</h4></a></td>
<td align="center" valign="top"><h5 align="center"><IMG SRC="http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13" ALIGN=middle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href="http://www.digits.net/ ">Web Counter</a><br>Counting only since 13 December 2002</h5></td>
<td><a href="http://InformationR.net/ir/"><h4>Home</h4></a></td>
</tr>
</table>
<hr color=#ff00ff SIZE=3> //end element
EDIT I experimented a bit with preceding-sibling and following-sibling to target nodes. Unfortunately it does not include the targeted nodes in the list.
var footerTags = document.DocumentNode.SelectNodes("//*[preceding-sibling::p[contains(text(),'How to cite this')] and following-sibling::hr[@color = '#ff00ff']]");
It finds the paragraph with the text "How to cite this", then selects all node between it and down to the hr with color "ff00ff". But does not include the actual selected nodes in list to remove, and they need to be removed along with the selected nodes.
Assuming the start and end nodes are truly the same (same tag name, attributes, and attribute values) as you mentioned in the comments above, it's not too hard:
- Select the start node.
- Iterate over and remove each sibling up to and including the end node.
- Remove the start node.
Sample HTML:
var html =
@"<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<div>DO NOT DELETE</div>
<hr color=""#ff00ff"" SIZE='3'> //start element
<p style='text-align : center; color : Red; font-weight : bold;'>How to cite this paper:</i></p>
<p style='text-align : left; color : black;'>Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996) "Stemming and N-gram matching for term conflation in Turkish texts" <em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style='text-align : center'>© the authors, 1996.</p>
<hr color='#ff00ff' size='1'><div align='center'>Check for citations, <a href='http://scholar.google.co.uk/scholar?hl=en&q=http://informationr.net/ir/2-2/paper13.html&btnG=Search&as_sdt=2000'>using Google Scholar</a></div>
<hr color='#ff00ff' size='1'>
<table border='0' cellpadding='15' cellspacing='0' align='center'>
<tr>
<td><a href='infres22.html'><h4>Contents</h4></a></td>
<td align='center' valign='top'><h5 align='center'><IMG SRC='http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13' ALIGN=middle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href='http://www.digits.net/'>Web Counter</a><br>Counting only since 13 December 2002</h5></td>
<td><a href='http://InformationR.net/ir/'><h4>Home</h4></a></td>
</tr>
</table>
<hr COLOR='#ff00ff' SIZE=""3""> //end element
<div>DO NOT DELETE</div>
</body>
</html>";
Parse it:
var document = new HtmlDocument();
document.LoadHtml(html);
var startNode = document.DocumentNode.SelectSingleNode("//hr[@size='3'][@color='#ff00ff']");
// account for mismatched quotes in HTML source
var quotesRegex = new Regex("[\"']");
var startNodeNoQuotes = quotesRegex.Replace(startNode.OuterHtml, "");
HtmlNode siblingNode;
while ( (siblingNode = startNode.NextSibling) != null)
{
siblingNode.Remove();
if (quotesRegex.Replace(siblingNode.OuterHtml, "") == startNodeNoQuotes)
{
break; // end node
}
}
startNode.Remove();
Resulting output:
<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<div>DO NOT DELETE</div>
//end element
<div>DO NOT DELETE</div>
</body>
</html>
I think, you expect this ,
Code
string content = System.IO.File.ReadAllText(@"D:\New Text Document.txt");
string html = Regex.Replace(content, "<hr.*?>", "", RegexOptions.Singleline);
Result
//start element
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<p style="text-align : left; color : black;">Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996) "Stemming and N-gram matching for term conflation in Turkish texts" <em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style="text-align : center">© the authors, 1996.</p>
<div align="center">Check for citations, <a href="http://scholar.google.co.uk/scholar?hl=en&q=http://informationr.net/ir/2-2/paper13.html&btnG=Search&as_sdt=2000">using Google Scholar</a></div>
<table border="0" cellpadding="15" cellspacing="0" align="center">
<tr>
<td><a href="infres22.html"><h4>Contents</h4></a></td>
<td align="center" valign="top"><h5 align="center"><IMG SRC="http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13" ALIGN=middle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href="http://www.digits.net/ ">Web Counter</a><br>Counting only since 13 December 2002</h5></td>
<td><a href="http://InformationR.net/ir/"><h4>Home</h4></a></td>
</tr>
</table>
//end element
来源:https://stackoverflow.com/questions/48891744/delete-all-elements-between-two-elements