Using C# regular expressions to remove HTML tags

前端 未结 10 1828
悲&欢浪女
悲&欢浪女 2020-11-22 05:59

How do I use C# regular expression to replace/remove all HTML tags, including the angle brackets? Can someone please help me with the code?

10条回答
  •  日久生厌
    2020-11-22 07:02

    The correct answer is don't do that, use the HTML Agility Pack.

    Edited to add:

    To shamelessly steal from the comment below by jesse, and to avoid being accused of inadequately answering the question after all this time, here's a simple, reliable snippet using the HTML Agility Pack that works with even most imperfectly formed, capricious bits of HTML:

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(Properties.Resources.HtmlContents);
    var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
    StringBuilder output = new StringBuilder();
    foreach (string line in text)
    {
       output.AppendLine(line);
    }
    string textOnly = HttpUtility.HtmlDecode(output.ToString());
    

    There are very few defensible cases for using a regular expression for parsing HTML, as HTML can't be parsed correctly without a context-awareness that's very painful to provide even in a nontraditional regex engine. You can get part way there with a RegEx, but you'll need to do manual verifications.

    Html Agility Pack can provide you a robust solution that will reduce the need to manually fix up the aberrations that can result from naively treating HTML as a context-free grammar.

    A regular expression may get you mostly what you want most of the time, but it will fail on very common cases. If you can find a better/faster parser than HTML Agility Pack, go for it, but please don't subject the world to more broken HTML hackery.

提交回复
热议问题