How to remove [removed] tags from an HTML page using C#?

前端未结

关注

 5  1109


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-15 22:39
              
            
            
                                                                       
It can be done using regex:

Regex rRemScript = new Regex(@"<script[^>]*>[\s\S]*?</script>");
output = rRemScript.Replace(input, "");

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-12-15 22:39
              
            
            
                                                                       
using regex:

string result = Regex.Replace(
    input, 
    @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n|\s)*?>", 
    string.Empty, 
    RegexOptions.Singleline | RegexOptions.IgnoreCase
);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  谎友^        
                
              
                            
                2020-12-15 22:49
              
            
            
                                                                       
This may seem like a strange solution.

If you don't want to use any third party library to do it and don't need to actually remove the script code, just kind of disable it, you could do this:

html = Regex.Replace(html , @"<script[^>]*>", "<!--");
html = Regex.Replace(html , @"<\/script>", "-->");


This creates an HTML comment out of script tags.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2020-12-15 22:51
              
            
            
                                                                       
I think as others have said, HtmlAgility pack is the best route. I've used this to scrape and remove loads of hard to corner cases. However, if a simple regex is your goal, then maybe you could try <script(.+?)*</script>. This will remove nasty nested javascript as well as normal stuff, i.e the type referred to in the link (Regular Expression for Extracting Script Tags):

<html>
<head>
    <script type="text/javascript" src="jquery.js"></script>
    <script type="text/javascript">
        if (window.self === window.top) { $.getScript("Wing.js"); }
    </script>
    <script> // nested horror
    var s = "<script></script>";
    </script>
</head>
</html>


usage:

Regex regxScriptRemoval = new Regex(@"<script(.+?)*</script>");
var newHtml = regxScriptRemoval.Replace(oldHtml, "");

return newHtml; // etc etc

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2020-12-15 22:54
              
            
            
                                                                       
May be worth a look: HTML Agility Pack

Edit: specific working code

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string sampleHtml = 
    "<html>" +
        "<head>" + 
                "<script type=\"text/javascript\" src=\"jquery.js\"></script>" +
                "<script type=\"text/javascript\">" + 
                    "if (window.self === window.top) { $.getScript(\"Wing.js\"); }" +
                "</script>" +
        "</head>" +
    "</html>";
MemoryStream ms = new MemoryStream(Encoding.ASCII.GetBytes(sampleHtml));

doc.Load(ms);

List<HtmlNode> nodes = new List<HtmlNode>(doc.DocumentNode.Descendants("head"));
int childNodeCount = nodes[0].ChildNodes.Count;
for (int i = 0; i < childNodeCount; i++)
    nodes[0].ChildNodes.Remove(0);
Console.WriteLine(doc.DocumentNode.OuterHtml);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复