Losing the 'less than' sign in HtmlAgilityPack loadhtml

懵懂的女人 提交于 2019-12-10 12:55:00

问题


I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.

I have a string with the following content:

string s = "<span style=\"color: #0000FF;\"><</span>";

You see that in my span I have a 'less than' sign. I process this string with the following code:

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);

But when I do a quick and dirty look in the span like this:

htmlDocument.DocumentNode.ChildNodes[0].InnerHtml

I see that the span is empty.

What option do I need to set maintain the 'less than' sign. I already tried this:

htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;

but with no success.

I know it is invalid HTML. I am using this to fix invalid HTML and use HTMLEncode on the 'less than' signs

Please direct me in the right direction. Thanks in advance


回答1:


The Html Agility Packs detects this as an error and creates an HtmlParseError instance for it. You can read all errors using the ParseErrors of the HtmlDocument class. So, if you run this code:

    string s = "<span style=\"color: #0000FF;\"><</span>";
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(s);
    doc.Save(Console.Out);

    Console.WriteLine();
    Console.WriteLine();

    foreach (HtmlParseError err in doc.ParseErrors)
    {
        Console.WriteLine("Error");
        Console.WriteLine(" code=" + err.Code);
        Console.WriteLine(" reason=" + err.Reason);
        Console.WriteLine(" text=" + err.SourceText);
        Console.WriteLine(" line=" + err.Line);
        Console.WriteLine(" pos=" + err.StreamPosition);
        Console.WriteLine(" col=" + err.LinePosition);
    }

It will display this (the corrected text first, and details about the error then):

<span style="color: #0000FF;"></span>

Error
 code=EndTagNotRequired
 reason=End tag </> is not required
 text=<
 line=1
 pos=30
 col=31

So you can try to fix this error, as you have all required information (including line, column, and stream position) but the general process of fixing (not detecting) errors in HTML is very complex.




回答2:


As mentioned in another answer, the best solution I found was to pre-parse the HTML to convert orphaned < symbols to their HTML encoded value &lt;.

return Regex.Replace(html, "<(?![^<]+>)", "&lt;");



回答3:


Fix the markup, because your HTML string is invalid:

string s = "<span style=\"color: #0000FF;\">&lt;</span>";



回答4:


Although it is true that the given html is invalid, HtmlAgilityPack should still be able to parse it. It is not an uncommon mistake on the web to forget to encode "<", and if HtmlAgilityPack is used as a crawler, then it should anticipate bad html. I tested the example in IE, Chrome and Firefox, and they all show the extra < as text.

I wrote the following method that you can use to preprocess the html string and replace all 'unclosed' '<' characters with "&lt;":

static string PreProcess(string htmlInput)
{
    // Stores the index of the last unclosed '<' character, or -1 if the last '<' character is closed.
    int lastGt = -1; 

    // This list will be populated with all the unclosed '<' characters.
    List<int> gtPositions = new List<int>();

    // Collect the unclosed '<' characters.
    for (int i = 0; i < htmlInput.Length; i++)
    {
        if (htmlInput[i] == '<')
        {
            if (lastGt != -1)
                gtPositions.Add(lastGt);

            lastGt = i;
        }
        else if (htmlInput[i] == '>')
            lastGt = -1;
    }

    if (lastGt != -1)
        gtPositions.Add(lastGt);

    // If no unclosed '<' characters are found, then just return the input string.
    if (gtPositions.Count == 0)
        return htmlInput;

    // Build the output string, replace all unclosed '<' character by "&lt;".
    StringBuilder htmlOutput = new StringBuilder(htmlInput.Length + 3 * gtPositions.Count);
    int start = 0;

    foreach (int gtPosition in gtPositions)
    {
        htmlOutput.Append(htmlInput.Substring(start, gtPosition - start));
        htmlOutput.Append("&lt;");
        start = gtPosition + 1;
    }

    htmlOutput.Append(htmlInput.Substring(start));
    return htmlOutput.ToString();
}



回答5:


string "s" is bad html.

string s = "<span style=\"color: #0000FF;\">&lt;</span>";

it's true.



来源:https://stackoverflow.com/questions/5421527/losing-the-less-than-sign-in-htmlagilitypack-loadhtml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!