JSOUP adding extra encoded stuff for an html

点点圈 提交于 2019-12-18 09:49:22

问题


Actually JSOUP is adding some extra encoded values to my HTML in my jSOUP parser.I am trying to take care of it by

String url = "http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html";
System.out.println("Fetching %s..."+url);

Document doc = Jsoup.connect(url).get();
//System.out.println(doc.html());

Document.OutputSettings settings = doc.outputSettings();

settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.base);
settings.charset("ASCII");
String html = doc.html();
System.out.println(html);

But the Entities class is not found for some reason and is giving an error. My included lib are

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

The original HTML is

<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
<head>

</head>
<body>


<div style="background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;">                      

<div style="height:2058px; padding-left:0px; padding-top:36px;">


<iframe style="height:90px; width:728px;" />



</div>
</div>

</body>
</html>

The doc.html() from JSOUP gives this:

<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
 <head> 
  <style>

</style> 
 </head> 
 <body> 
  <div style="background-image: url(aol.jpeg); background-repeat: no-repeat;-webkit-background-size:90720;height:720; width:90; text-align: center; margin: 0 auto;"> 
   <div style="height:450; width:100; padding-left:681px; padding-top:200px;"> 
    <iframe style="height:1050px; width:300px;"></iframe> &lt;/div&gt; &lt;/div&gt; &lt;/body&gt; &lt;/html&gt;
   </div>
  </div>
 </body>
</html>

The iframe element has been added some encoded stuff.

Please help.

Thanks Swaraj


回答1:


Actually jsoup is not adding the encoded stuff. Jsoup just adds the closing tags that seem to be missing. Let me explain.

First of all, jsoup tries to format your html. In your case that means that it will add closing tags that are missing. Example

Document doc = Jsoup.parse("<div>test<span>test");
System.out.println(doc.html());

Output:

<html>
 <head></head>
 <body>
  <div>
   test
   <span>test</span>
  </div>
 </body>
</html>

If you check the encoded stuff you will realize that they are closing tags.

&lt;/div&gt;  = </div> 
&lt;/div&gt;  = </div>
&lt;/body&gt; = </body>

If you go to the site and press Ctrl+U (using chrome) then you will see what jsoup will parse. Chrome will give color to the valid html tags that it recognizes. For some odd reason it won't recognize the tags in the bottom (the same ones that appear with the escaped characters). For the same reason jsoup has a problem with those closing tags too. It doesn't treat them as closing tags, but as text, so it escapes them and then it normalizes the html by adding those tags as I explained earlier.

EDIT I managed to replicate the behavior.

Document doc = Jsoup.parse("<iframe /><span>test</span>");
System.out.println(doc.html());

You can see the exact same behavior. The problem is with the self closing iframe. Making it like this fixes the problem

Document doc = Jsoup.parse("<iframe></iframe><span>test</span>");
System.out.println(doc.html());

EDIT 2 If you want to just receive the html without building the document object you can do this

Connection.Response html = Jsoup.connect("http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html").execute();
System.out.println(html.body());

Having the above, you can find the self closing iframe and replace it with the valid representation (or remove it completely). Then you can parse that string with Jsoup.parse() This will fix the the issue of not recognizing the closing tags after iframe, because it will be valid.



来源:https://stackoverflow.com/questions/20908946/jsoup-adding-extra-encoded-stuff-for-an-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!