问题
I was trying to parse HTML document where I encountered the following scenario. I have put the content in the form of string in the following code. In this there is a P tag inside an anchor tag. If parsed with Jsoup, it adds an extra < /a> tag and < a> tags in between near #item1, changing the html structure.
public class Test{
public static void main(String[] args) {
String html="<A HREF=\"#Item1\">\n"
+ "<p style=\"font-family:times;margin-top:12pt;margin-left:0pt;\">\n"
+ "<FONT SIZE=2>Item 1.</FONT>\n"
+ "</A>";
Document doc = Jsoup.parse(html);
System.out.println("UNPARSED = \n"+html);
System.out.println("JSOUP PARSED = \n"+doc.toString());
}
}
OUTPUT
UNPARSED =
<A HREF="#Item1">
<p style="font-family:times;margin-top:12pt;margin-left:0pt;">
<FONT SIZE=2>Item 1.</FONT>
</A>
JSOUP PARSED =
<html>
<head></head>
<body>
<a href="#Item1"> </a>
<p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a> <font size="2">Item 1.</font> </a></p>
</body>
</html>
Is there any way to avoid the automatic tag completion using Jsoup. Thank you.
回答1:
-- UPDATE !!
As seen in How to prevent tags replacement?
There is a great solution to this problem:
Parsing with:
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Will give:
<a href="#Item1"> <p style="font-family:times;margin-top:12pt;margin-left:0pt;"> <font size="2">Item 1.</font> </p></a>
Thanks @user2784201!
-- OLD RESPONSE:
I'm not sure if what you are asking for is possible or not, but I think that it goes against JSoup philosophy of parsing html in a way as similar as possible to the way of a browser.
Note that browsers will also close that A tag too. I think this is because in HTML4 putting a P inside an A was forbidden. Look at this https://stackoverflow.com/a/1828032/3324704.
Bytheway I think you are using an old version of JSoup, if you use 1.8.1 you will see that the inner A tag (a spurious tag put there by JSoup, also by browsers) will mantain the href. This fact may help you in your parsing. See the output of JSoup 1.8.1 (Note the inner <a href="#Item1">
):
JSOUP PARSED =
<!DOCTYPE html>
<html>
<head></head>
<body>
<a href="#Item1"> </a>
<p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a href="#Item1"> <font size="2">Item 1.</font> </a></p>
</body>
</html>
Furthermore, I've tried other libraries. Htmlcleaner (here) fires an error (a - UnpermittedChild) and gives very similar output:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head></head>
<body><a href="#Item1">
</a><p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a href="#Item1">
<font size="2">Item 1.</font>
</a></p></body></html>
And jtidy (here) that says:
Warning: missing </a> before <p>
and gives:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" />
<title></title>
</head>
<body>
<a href="#Item1"></a>
<p style="font-family:times;margin-top:12pt;margin-left:0pt;"><font
size="2">Item 1.</font> </p>
</body>
</html>
Maybe you could use a regular XML parser...
Sorry for the verbosity and the unsatisfactory response :(
来源:https://stackoverflow.com/questions/27040626/how-to-turn-off-automatic-generation-of-close-tags-tagname-in-jsoup