问题
I need to process an HTML document and insert some nodes in a few places. The content I'm processing is not valid, but Nokogiri is smart enough to figure out what it should be. The problem is that I don't want to change the original document's formatting, other than the pieces I'm inserting.
Here is an example:
require 'nokogiri'
orig_html = '
<html>
<meta name="Generator" content="Microsoft Word 97 O.o">
<body>
1
<b><p>2</p></b>
3
</body>
</html>'
puts Nokogiri::HTML(orig_html).inner_html
# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >> <meta name="Generator" content="Microsoft Word 97 O.o">
# >> </head>
# >> <body>
# >> 1
# >> <b></b><p>2</p>
# >> 3
# >> </body>
# >> </html>
I'd like the output to be the same as the input. The problem is that I can't have <p>
inside of <b>
. My inclination is to switch to XML, but then there are invalid tags such as the <meta>
tag, which is not closed off. HTML is smart enough to recognize this, but XML is not.
回答1:
Nokogiri is fixing up the malformed HTML in order to make it parseable. After it has finished the DOM is in a reasonable state, but the original document isn't available from Nokogiri any more.
If you want the original to be untouched, you have to make it valid prior to passing it to Nokogiri, then you can manipulate it using Nokogiri's methods. Typically I'd do that using some regex to find the trouble spots and add/adjust tags or their associated closing tags, to allow Nokogiri to parse without needing to fix things.
It's not a case of HTML being smarter than XML, it's a case of Nokogiri honoring the spirit of the XML specification, which is rigid, and raising flags by populating the errors
array with the errors when the file is invalid. HTML has a less rigid specification, and, because browsers are (too) forgiving when parsing and displaying HTML, Nokogiri follows along somewhat, does fixups, and then populates the errors
array. (In either case, you can check that array to see what's wrong.)
require 'nokogiri'
orig_html = '
<html>
<meta name="Generator" content="Microsoft Word 97 O.o">
<body>
1
<b><p>2</p></b>
3
</body>
</html>'
doc = Nokogiri::HTML(orig_html)
doc.errors
doc.errors
contains:
[
[0] #<Nokogiri::XML::SyntaxError: Unexpected end tag : b>
]
Here's how I'd use Nokogiri to fix your sample HTML:
doc = Nokogiri::HTML(orig_html)
p = doc.at('b+p')
p.previous_sibling.remove
This is the HTML at this point:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 97 O.o">
</head>
<body>
1
<p>2</p>
3
</body>
</html>
Continuing:
p.inner_html = "<b>#{p.content}</b>"
puts doc.to_html
This is the resulting HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 97 O.o">
</head>
<body>
1
<p><b>2</b></p>
3
</body>
</html>
It's pretty obvious the sample HTML isn't what you're really working with, so you'll have to change the accessors to locate the tags that need to be changed, but that should get you going.
回答2:
The above works for the above specific situation, but not for a case like below.
orig_html = '
<html>
<meta name="Generator" content="Microsoft Word 97 O.o">
<body>
1
<b>this is a bold
<p>This is a paragraph</p>
</b>
3
</body>
</html>'
doc = Nokogiri::HTML(orig_html)
p = doc.at('b+p')
p.previous_sibling.remove
p.inner_html = "<b>#{p.content}</b>" # !> mismatched indentations at 'end' with 'def' at 17
puts doc.to_html
The Resulting HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 97 O.o">
</head>
<body>
1
<p><b>This is a paragraph</b></p>
3
</body>
</html>
来源:https://stackoverflow.com/questions/14515190/nokogiri-generating-invalid-html