JSOUP adding extra encoded stuff for an html

后端未结

关注

 1  1504

Actually JSOUP is adding some extra encoded values to my HTML in my jSOUP parser.I am trying to take care of it by

String url = \"http://iqtestsites.adtech


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2020-12-22 06:34
              
            
            
                                                                       
Actually jsoup is not adding the encoded stuff. Jsoup just adds the closing tags that seem to be missing. Let me explain.
First of all, jsoup tries to format your html. In your case that means that it will add closing tags that are missing.
Example
Document doc = Jsoup.parse("<div>test<span>test");
System.out.println(doc.html());

Output:
<html>
 <head></head>
 <body>
  <div>
   test
   <span>test</span>
  </div>
 </body>
</html>

If you check the encoded stuff you will realize that they are closing tags.
&lt;/div&gt;  = </div> 
&lt;/div&gt;  = </div>
&lt;/body&gt; = </body>

If you go to the site and press Ctrl+U (using chrome) then you will see what jsoup
will parse. Chrome will give color to the valid html tags that it recognizes. For some odd reason it won't recognize the tags in the bottom (the same ones that appear with the escaped characters). For the same reason jsoup has a problem with those closing tags too. It doesn't treat them as closing tags, but as text, so it escapes them and then it normalizes the html by adding those tags as I explained earlier.
EDIT
I managed to replicate the behavior.
Document doc = Jsoup.parse("<iframe /><span>test</span>");
System.out.println(doc.html());

You can see the exact same behavior. The problem is with the self closing iframe. Making it like this fixes the problem
Document doc = Jsoup.parse("<iframe></iframe><span>test</span>");
System.out.println(doc.html());

EDIT 2
If you want to just receive the html without building the document object you can do this
Connection.Response html = Jsoup.connect("http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html").execute();
System.out.println(html.body());

Having the above, you can find the self closing iframe and replace it with the valid representation (or remove it completely). Then you can parse that string with Jsoup.parse()
This will fix the issue of not recognizing the closing tags after iframe, because it will be valid.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复