Python XML parsing not working for some sites

问题

I have a very basic XML parser based on the tutorial provided here, for the purpose of reading RSS feeds in Python.

def GetRSS(RSSurl):
    url_info = urllib.urlopen(RSSurl)
    if (url_info):
        xmldoc = minidom.parse(url_info)
    if (xmldoc):
        for item_node in xmldoc.documentElement.childNodes:
            if (item_node.nodeName == "item"):  
                PrintNodeItems(item_node, ["title","link"])
    else:
        print "error"

def PrintNodeItems(XmlNode, items):
    for item_node in XmlNode.childNodes:
        if item_node.nodeName in items:
            PrintNodesText(item_node)

def PrintNodesText(XmlNode):
    text = ""
    for text_node in XmlNode.childNodes:
        if(text_node.nodeType == Node.TEXT_NODE):
            text = text_node.nodeValue
    if (len(text)>0):
        print text
        print ""

I have tested the GetRSS function on the address provided in the tutorial (http://rss.slashdot.org/Slashdot/slashdot), and it works just fine, providing me with the correct feedback. However, my intention when learning how to write this module was to use it for reading the RSS feed at RedLetterMedia (http://redlettermedia.com/feed/). When I attempt to use the GetRSS function in the Python Shell on that address, I get a blank line as feedback instead of the correct results. I also tested it on CNN's "World" RSS feed, and received no results for that as well. I have used urllib.urlopen on all addresses and they all appear to use the same format for their nodes and child nodes (<item><title><description><link></item>).

I figure, as was the case for my previous question, there is probably something very obvious I am missing. Does anybody know what that is?

Edit: and for the record, my error message has not come up at all, but maybe that's because I integrated it into the code incorrectly; I would not put it beyond me.

update: Rewrote code from scratch using multiple answered questions on stackoverflow. Works like a charm!

def GetRSS(RSSurl):
    url_info = urllib.urlopen(RSSurl)
    if (url_info):
        xmldoc = minidom.parse(url_info)
    if (xmldoc):
        channel = xmldoc.getElementsByTagName('channel')
        for node in channel:
            item = xmldoc.getElementsByTagName('item')
            for node in item:
                alist = xmldoc.getElementsByTagName('link')
                for a in alist: 
                    linktext = a.firstChild.data
                    print linktext


def main():
    GetRSS('http://redlettermedia.com/feed/')

回答1:

The error is here:

for item_node in xmldoc.documentElement.childNodes:
    if (item_node.nodeName == "item"):

There is no root item element, just a channel. I found this out by just printing all the values of nodeName in the loop.

来源：https://stackoverflow.com/questions/9190934/python-xml-parsing-not-working-for-some-sites

标签

python

xml-parsing