C# How to delete XML/HTML comments with regular expression

前端 未结 4 1301
说谎
说谎 2020-12-09 04:17

The fragment below doesn\'t work for me.

fragment = Regex.Replace(fragment, \"\", String.Empty , RegexOptions.Multiline  );
4条回答
  •  南笙
    南笙 (楼主)
    2020-12-09 05:21

    This is the top Google result for stripping comments via C#, and here's my HtmlAgilityPack code for doing this.

            HtmlDocument doc = new HtmlDocument
                               {
                                   OptionFixNestedTags = true,
                                   OptionOutputAsXml = true
                               };
            doc.LoadHtml(str);
    
            // Script comments from the document. 
            if (doc.DocumentNode != null)
            {
                HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//comment()");
                if (nodes != null)
                {
                    foreach (HtmlNode node in from cmt in nodes
                                              where (cmt != null
                                                     && cmt.InnerText != null
                                                     && !cmt.InnerText.ToUpper().StartsWith("DOCTYPE"))
                                                     && cmt.ParentNode != null
                                              select cmt)
                    {
                        node.ParentNode.RemoveChild(node);
                    }
                }
            }
    

    This works correctly at stripping comments, and ignores the doctype which is treated as a comment by HtmlAgilityPack.

    While regex does work in controlled conditions. If you're processing HTML from the wild web then I'd recommend using HtmlAgilityPack. The HTML that is out there is very unpredictable, and regex will break.

提交回复
热议问题