JSOUP adding extra encoded stuff for an html

后端 未结 1 1504
[愿得一人]
[愿得一人] 2020-12-22 06:02

Actually JSOUP is adding some extra encoded values to my HTML in my jSOUP parser.I am trying to take care of it by

String url = \"http://iqtestsites.adtech         


        
相关标签:
1条回答
  • 2020-12-22 06:34

    Actually jsoup is not adding the encoded stuff. Jsoup just adds the closing tags that seem to be missing. Let me explain.

    First of all, jsoup tries to format your html. In your case that means that it will add closing tags that are missing. Example

    Document doc = Jsoup.parse("<div>test<span>test");
    System.out.println(doc.html());
    

    Output:

    <html>
     <head></head>
     <body>
      <div>
       test
       <span>test</span>
      </div>
     </body>
    </html>
    

    If you check the encoded stuff you will realize that they are closing tags.

    &lt;/div&gt;  = </div> 
    &lt;/div&gt;  = </div>
    &lt;/body&gt; = </body>
    

    If you go to the site and press Ctrl+U (using chrome) then you will see what jsoup will parse. Chrome will give color to the valid html tags that it recognizes. For some odd reason it won't recognize the tags in the bottom (the same ones that appear with the escaped characters). For the same reason jsoup has a problem with those closing tags too. It doesn't treat them as closing tags, but as text, so it escapes them and then it normalizes the html by adding those tags as I explained earlier.

    EDIT I managed to replicate the behavior.

    Document doc = Jsoup.parse("<iframe /><span>test</span>");
    System.out.println(doc.html());
    

    You can see the exact same behavior. The problem is with the self closing iframe. Making it like this fixes the problem

    Document doc = Jsoup.parse("<iframe></iframe><span>test</span>");
    System.out.println(doc.html());
    

    EDIT 2 If you want to just receive the html without building the document object you can do this

    Connection.Response html = Jsoup.connect("http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html").execute();
    System.out.println(html.body());
    

    Having the above, you can find the self closing iframe and replace it with the valid representation (or remove it completely). Then you can parse that string with Jsoup.parse() This will fix the issue of not recognizing the closing tags after iframe, because it will be valid.

    0 讨论(0)
提交回复
热议问题