Nokogiri generating invalid HTML?

孤街浪徒 提交于 2019-12-08 07:46:55

问题


I need to process an HTML document and insert some nodes in a few places. The content I'm processing is not valid, but Nokogiri is smart enough to figure out what it should be. The problem is that I don't want to change the original document's formatting, other than the pieces I'm inserting.

Here is an example:

require 'nokogiri'

orig_html = '
  <html>
  <meta name="Generator" content="Microsoft Word 97 O.o">
  <body>
    1
    <b><p>2</p></b>
    3
  </body>
</html>'

puts Nokogiri::HTML(orig_html).inner_html

# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >> <meta name="Generator" content="Microsoft Word 97 O.o">
# >> </head>
# >> <body>
# >>         1
# >>         <b></b><p>2</p>
# >>         3
# >>       </body>
# >> </html>

I'd like the output to be the same as the input. The problem is that I can't have <p> inside of <b>. My inclination is to switch to XML, but then there are invalid tags such as the <meta> tag, which is not closed off. HTML is smart enough to recognize this, but XML is not.


回答1:


Nokogiri is fixing up the malformed HTML in order to make it parseable. After it has finished the DOM is in a reasonable state, but the original document isn't available from Nokogiri any more.

If you want the original to be untouched, you have to make it valid prior to passing it to Nokogiri, then you can manipulate it using Nokogiri's methods. Typically I'd do that using some regex to find the trouble spots and add/adjust tags or their associated closing tags, to allow Nokogiri to parse without needing to fix things.

It's not a case of HTML being smarter than XML, it's a case of Nokogiri honoring the spirit of the XML specification, which is rigid, and raising flags by populating the errors array with the errors when the file is invalid. HTML has a less rigid specification, and, because browsers are (too) forgiving when parsing and displaying HTML, Nokogiri follows along somewhat, does fixups, and then populates the errors array. (In either case, you can check that array to see what's wrong.)

require 'nokogiri'

orig_html = '
  <html>
  <meta name="Generator" content="Microsoft Word 97 O.o">
  <body>
    1
    <b><p>2</p></b>
    3
  </body>
</html>'

doc = Nokogiri::HTML(orig_html)
doc.errors

doc.errors contains:

[
    [0] #<Nokogiri::XML::SyntaxError: Unexpected end tag : b>
]

Here's how I'd use Nokogiri to fix your sample HTML:

doc = Nokogiri::HTML(orig_html)
p = doc.at('b+p')
p.previous_sibling.remove

This is the HTML at this point:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 97 O.o">
</head>
<body>
    1
    <p>2</p>
    3
  </body>
</html>

Continuing:

p.inner_html = "<b>#{p.content}</b>"
puts doc.to_html

This is the resulting HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 97 O.o">
</head>
<body>
    1
    <p><b>2</b></p>
    3
  </body>
</html>

It's pretty obvious the sample HTML isn't what you're really working with, so you'll have to change the accessors to locate the tags that need to be changed, but that should get you going.




回答2:


The above works for the above specific situation, but not for a case like below.

    orig_html = '
      <html>
      <meta name="Generator" content="Microsoft Word 97 O.o">
      <body>
        1
        <b>this is a bold
          <p>This is a paragraph</p>
        </b>
        3
      </body>
    </html>'

    doc = Nokogiri::HTML(orig_html)
    p = doc.at('b+p')

    p.previous_sibling.remove
    p.inner_html = "<b>#{p.content}</b>" # !> mismatched indentations at 'end' with 'def' at 17
     puts doc.to_html

The Resulting HTML:

   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta name="Generator" content="Microsoft Word 97 O.o">
    </head>
    <body>
        1
         <p><b>This is a paragraph</b></p>

        3
      </body>
    </html>


来源:https://stackoverflow.com/questions/14515190/nokogiri-generating-invalid-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!