Python XML parsing not working for some sites

蹲街弑〆低调 提交于 2019-12-11 10:01:57

问题


I have a very basic XML parser based on the tutorial provided here, for the purpose of reading RSS feeds in Python.

def GetRSS(RSSurl):
    url_info = urllib.urlopen(RSSurl)
    if (url_info):
        xmldoc = minidom.parse(url_info)
    if (xmldoc):
        for item_node in xmldoc.documentElement.childNodes:
            if (item_node.nodeName == "item"):  
                PrintNodeItems(item_node, ["title","link"])
    else:
        print "error"

def PrintNodeItems(XmlNode, items):
    for item_node in XmlNode.childNodes:
        if item_node.nodeName in items:
            PrintNodesText(item_node)

def PrintNodesText(XmlNode):
    text = ""
    for text_node in XmlNode.childNodes:
        if(text_node.nodeType == Node.TEXT_NODE):
            text = text_node.nodeValue
    if (len(text)>0):
        print text
        print ""

I have tested the GetRSS function on the address provided in the tutorial (http://rss.slashdot.org/Slashdot/slashdot), and it works just fine, providing me with the correct feedback. However, my intention when learning how to write this module was to use it for reading the RSS feed at RedLetterMedia (http://redlettermedia.com/feed/). When I attempt to use the GetRSS function in the Python Shell on that address, I get a blank line as feedback instead of the correct results. I also tested it on CNN's "World" RSS feed, and received no results for that as well. I have used urllib.urlopen on all addresses and they all appear to use the same format for their nodes and child nodes (<item><title><description><link></item>).

I figure, as was the case for my previous question, there is probably something very obvious I am missing. Does anybody know what that is?

Edit: and for the record, my error message has not come up at all, but maybe that's because I integrated it into the code incorrectly; I would not put it beyond me.

update: Rewrote code from scratch using multiple answered questions on stackoverflow. Works like a charm!

def GetRSS(RSSurl):
    url_info = urllib.urlopen(RSSurl)
    if (url_info):
        xmldoc = minidom.parse(url_info)
    if (xmldoc):
        channel = xmldoc.getElementsByTagName('channel')
        for node in channel:
            item = xmldoc.getElementsByTagName('item')
            for node in item:
                alist = xmldoc.getElementsByTagName('link')
                for a in alist: 
                    linktext = a.firstChild.data
                    print linktext


def main():
    GetRSS('http://redlettermedia.com/feed/')

回答1:


The error is here:

for item_node in xmldoc.documentElement.childNodes:
    if (item_node.nodeName == "item"):

There is no root item element, just a channel. I found this out by just printing all the values of nodeName in the loop.



来源:https://stackoverflow.com/questions/9190934/python-xml-parsing-not-working-for-some-sites

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!