How to turn off automatic generation of close tags </tagName> in Jsoup?

筅森魡賤 提交于 2019-12-23 04:23:45

问题


I was trying to parse HTML document where I encountered the following scenario. I have put the content in the form of string in the following code. In this there is a P tag inside an anchor tag. If parsed with Jsoup, it adds an extra < /a> tag and < a> tags in between near #item1, changing the html structure.

public class Test{
        public static void main(String[] args) {

            String html="<A HREF=\"#Item1\">\n"
                    + "<p style=\"font-family:times;margin-top:12pt;margin-left:0pt;\">\n"
                    + "<FONT SIZE=2>Item&nbsp;1.</FONT>\n"
                    + "</A>";
            Document doc = Jsoup.parse(html);
            System.out.println("UNPARSED = \n"+html);
            System.out.println("JSOUP PARSED = \n"+doc.toString());

        }
}

OUTPUT

        UNPARSED = 
        <A HREF="#Item1">
        <p style="font-family:times;margin-top:12pt;margin-left:0pt;">
        <FONT SIZE=2>Item&nbsp;1.</FONT>
        </A>
        JSOUP PARSED = 
        <html>
         <head></head>
         <body>
          <a href="#Item1"> </a>
          <p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a> <font size="2">Item&nbsp;1.</font> </a></p>
         </body>
        </html>

Is there any way to avoid the automatic tag completion using Jsoup. Thank you.


回答1:


-- UPDATE !!

As seen in How to prevent tags replacement?

There is a great solution to this problem:

Parsing with:

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

Will give:

<a href="#Item1"> <p style="font-family:times;margin-top:12pt;margin-left:0pt;"> <font size="2">Item&nbsp;1.</font> </p></a>

Thanks @user2784201!

-- OLD RESPONSE:

I'm not sure if what you are asking for is possible or not, but I think that it goes against JSoup philosophy of parsing html in a way as similar as possible to the way of a browser.

Note that browsers will also close that A tag too. I think this is because in HTML4 putting a P inside an A was forbidden. Look at this https://stackoverflow.com/a/1828032/3324704.

Bytheway I think you are using an old version of JSoup, if you use 1.8.1 you will see that the inner A tag (a spurious tag put there by JSoup, also by browsers) will mantain the href. This fact may help you in your parsing. See the output of JSoup 1.8.1 (Note the inner <a href="#Item1">):

JSOUP PARSED = 
<!DOCTYPE html>
<html>
 <head></head>
 <body>
  <a href="#Item1"> </a>
  <p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a href="#Item1"> <font size="2">Item&nbsp;1.</font> </a></p>
 </body>
</html>

Furthermore, I've tried other libraries. Htmlcleaner (here) fires an error (a - UnpermittedChild) and gives very similar output:

<?xml version="1.0" encoding="UTF-8"?>
<html>
<head></head>
<body><a href="#Item1">
</a><p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a href="#Item1">
<font size="2">Item 1.</font>
</a></p></body></html>

And jtidy (here) that says:

Warning: missing </a> before <p>

and gives:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" />
<title></title>
</head>
<body>
<a href="#Item1"></a>
<p style="font-family:times;margin-top:12pt;margin-left:0pt;"><font
size="2">Item&nbsp;1.</font> </p>
</body>
</html>

Maybe you could use a regular XML parser...

Sorry for the verbosity and the unsatisfactory response :(



来源:https://stackoverflow.com/questions/27040626/how-to-turn-off-automatic-generation-of-close-tags-tagname-in-jsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!