How can I get Nokogiri to parse and return an XML document?

[亡魂溺海] 提交于 2019-12-05 00:46:01

问题


Here's a sample of some oddness:

#!/usr/bin/ruby

require 'rubygems'
require 'open-uri'
require 'nokogiri'

print "without read: ", Nokogiri(open('http://weblog.rubyonrails.org/')).class, "\n"
print "with read:    ", Nokogiri(open('http://weblog.rubyonrails.org/').read).class, "\n"

Running this returns:

without read: Nokogiri::XML::Document
with read:    Nokogiri::HTML::Document

Without the read returns XML, and with it is HTML? The web page is defined as "XHTML transitional", so at first I thought Nokogiri must have been reading OpenURI's "content-type" from the stream, but that returns 'text/html':

(rdb:1) doc = open(('http://weblog.rubyonrails.org/'))
(rdb:1) doc.content_type
"text/html"

which is what the server is returning. So, now I'm trying to figure out why Nokogiri is returning two different values. It doesn't appear to be parsing the text and using heuristics to determine whether the content is HTML or XML.

The same thing is happening with the ATOM feed pointed to by that page:

(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
(rdb:1) doc.class
Nokogiri::XML::Document

(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails').read)
(rdb:1) doc.class
Nokogiri::HTML::Document

I need to be able to parse a page without knowing what it is in advance, either HTML or a feed (RSS or ATOM) and reliably determine which it is. I asked Nokogiri to parse the body of either a HTML or XML feed file, but I'm seeing those inconsistent results.

I thought I could write some tests to determine the type but then I ran into xpaths not finding elements, but regular searches working:

(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
(rdb:1) doc.class
Nokogiri::XML::Document
(rdb:1) doc.xpath('/feed/entry').length
0
(rdb:1) doc.search('feed entry').length
15

I figured xpaths would work with XML but the results don't look trustworthy either.

These tests were all done on my Ubuntu box, but I've seen the same behavior on my Macbook Pro. I'd love to find out I'm doing something wrong, but I haven't seen an example for parsing and searching that gave me consistent results. Can anyone show me the error of my ways?


回答1:


It has to do with the way Nokogiri's parse method works. Here's the source:

# File lib/nokogiri.rb, line 55
    def parse string, url = nil, encoding = nil, options = nil
      doc =
        if string =~ /^\s*<[^Hh>]*html/i # Probably html
          Nokogiri::HTML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_HTML)
        else
          Nokogiri::XML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_XML)
        end
      yield doc if block_given?
      doc
    end

The key is the line if string =~ /^\s*<[^Hh>]*html/i # Probably html. When you just use open, it returns an object that doesn't work with regex, thus it always returns false. On the other hand, read returns a string, so it could be regarded as HTML. In this case it is, because it matches that regex. Here's the start of that string:

<!DOCTYPE html PUBLIC

The regex matches the "!DOCTYPE " to [^Hh>]* and then matches the "html", thus assuming it's HTML. Why someone selected this regex to determine if the file is HTML is beyond me. With this regex, a file that begins with a tag like <definitely-not-html> is considered HTML, but <this-is-still-not-html> is considered XML. You're probably best off staying away from this dumb function and invoking Nokogiri::HTML::Document#parse or Nokogiri::XML::Document#parse directly.




回答2:


Responding to this part of your question:

I thought I could write some tests to determine the type but then I ran into xpaths not finding elements, but regular searches working:

I've just come across this problem using nokogiri to parse an atom feed. The problem seemed down to the anonymous name-space declaration:

<feed xmlns="http://www.w3.org/2005/Atom">

Removing the xmlns declaration from the source xml would enable Nokogiri to search with xpath as per usual. Removing that declaration from the feed obviously wasn't an option here, so instead I just removed the namespaces from the document after parsing. eg:

doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
doc.remove_namespaces!
doc.xpath('/feed/entry').length

Ugly I know, but it did the trick.



来源:https://stackoverflow.com/questions/1157138/how-can-i-get-nokogiri-to-parse-and-return-an-xml-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!