How to remove [removed] tags from an HTML page using C#?

前端 未结 5 1095
盖世英雄少女心
盖世英雄少女心 2020-12-15 22:29

    
        
        

        
相关标签:
5条回答
  • 2020-12-15 22:39

    It can be done using regex:

    Regex rRemScript = new Regex(@"<script[^>]*>[\s\S]*?</script>");
    output = rRemScript.Replace(input, "");
    
    0 讨论(0)
  • 2020-12-15 22:39

    using regex:

    string result = Regex.Replace(
        input, 
        @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n|\s)*?>", 
        string.Empty, 
        RegexOptions.Singleline | RegexOptions.IgnoreCase
    );
    
    0 讨论(0)
  • 2020-12-15 22:49

    This may seem like a strange solution.

    If you don't want to use any third party library to do it and don't need to actually remove the script code, just kind of disable it, you could do this:

    html = Regex.Replace(html , @"<script[^>]*>", "<!--");
    html = Regex.Replace(html , @"<\/script>", "-->");
    

    This creates an HTML comment out of script tags.

    0 讨论(0)
  • 2020-12-15 22:51

    I think as others have said, HtmlAgility pack is the best route. I've used this to scrape and remove loads of hard to corner cases. However, if a simple regex is your goal, then maybe you could try <script(.+?)*</script>. This will remove nasty nested javascript as well as normal stuff, i.e the type referred to in the link (Regular Expression for Extracting Script Tags):

    <html>
    <head>
        <script type="text/javascript" src="jquery.js"></script>
        <script type="text/javascript">
            if (window.self === window.top) { $.getScript("Wing.js"); }
        </script>
        <script> // nested horror
        var s = "<script></script>";
        </script>
    </head>
    </html>
    

    usage:

    Regex regxScriptRemoval = new Regex(@"<script(.+?)*</script>");
    var newHtml = regxScriptRemoval.Replace(oldHtml, "");
    
    return newHtml; // etc etc
    
    0 讨论(0)
  • 2020-12-15 22:54

    May be worth a look: HTML Agility Pack

    Edit: specific working code

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    string sampleHtml = 
        "<html>" +
            "<head>" + 
                    "<script type=\"text/javascript\" src=\"jquery.js\"></script>" +
                    "<script type=\"text/javascript\">" + 
                        "if (window.self === window.top) { $.getScript(\"Wing.js\"); }" +
                    "</script>" +
            "</head>" +
        "</html>";
    MemoryStream ms = new MemoryStream(Encoding.ASCII.GetBytes(sampleHtml));
    
    doc.Load(ms);
    
    List<HtmlNode> nodes = new List<HtmlNode>(doc.DocumentNode.Descendants("head"));
    int childNodeCount = nodes[0].ChildNodes.Count;
    for (int i = 0; i < childNodeCount; i++)
        nodes[0].ChildNodes.Remove(0);
    Console.WriteLine(doc.DocumentNode.OuterHtml);
    
    0 讨论(0)
提交回复
热议问题