Why is BeautifulSoup unable to correctly read/parse this RSS (XML) document?

吃可爱长大的小学妹 提交于 2019-12-04 03:15:29

Wow, great question. This strikes me as a bug in BeautifulSoup. The reason that you can't access the link using soup.find_all('item').link is that when you first load the html into BeautifulSoup to begin with, it does something odd to the HTML:

>>> from bs4 import BeautifulSoup as BS
>>> BS(html)
<html><body><rss version="2.0">
<channel>
<title>Hacker News</title><link/>http://news.ycombinator.com/<description>Links
for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'No
tch'</title>
<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
ollar-boost-mark-cuban-and-notch
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
<description>Comments]]&gt;</description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</ti
tle>
<link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
<description>Comments]]&gt;</description>
</item>
...
</channel>
</rss></body></html>

Look carefully--it has actually changed the first <link> tag to <link/> and then removed the </link> tag. I'm not sure why it would do this, but without fixing the problem in the BeautifulSoup.BeautifulSoup class initialization, you're not going to be able to use it for now.

Update:

I think your best (albeit hack-y) bet for now is to use the following for link:

>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'

Actually, the problem seems to be related with the parser you are using. By default, a HTML one is used. Try using soup = BeautifulSoup(request.text, 'xml') after installing the lxml module.

It will then use a XML parser instead of a HTML one and it should be all ok.

See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for more info

Tony Wang

@Yan Hudon is right .I have solved the problem with soup = BeautifulSoup(request.text, 'xml')

I don't think there's a bug in BeautifulSoup here.

I installed a clean copy of BS4 4.1.3 on Apple's stock 2.7.2 from OS X 10.8.2, and everything worked as expected. It doesn't mis-parse the <link> as </link>, and therefore it doesn't have the problem with the item.find('link').

I also tried using the stock xml.etree.ElementTree and xml.etree.cElementTree in 2.7.2, and xml.etree.ElementTree in python.org 3.3.0, to parse the same thing, and it again worked fine. Here's the code:

import xml.etree.ElementTree as ET

rss = ET.fromstring(x)
for channel in rss.findall('channel'):
  for item in channel.findall('item'):
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print(title)
    print(link)
    print(comments)

I then installed lxml 3.0.2 (I believe BS uses lxml if available), using Apple's built-in /usr/lib/libxml2.2.dylib (which, according to xml2-config --version, is 2.7.8), and did the same tests using its etree, and using BS, and again, everything worked.

In addition to screwing up the <link>, jdotjdot's output shows that BS4 is screwing up the <description> in an odd way. The original is this:

<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>

His output is:

<description>Comments]]&gt;</description>

My output from running his exact same code is:

<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>

So, it seems like there's a much bigger problem going on here. The odd thing is that it's happening to two different people, when it isn't happening on a clean install of the latest version of anything.

That implies either that it's a bug that's been fixed and I just have a newer version of whatever had the bug, or it's something weird about the way they both installed something.

BS4 itself can be ruled out, since at least Treebranch has 4.1.3 just like me. Although, without knowing how he installed it, it could be a problem with the installation.

Python and its built-in etree can be ruled out, since at least Treebranch has the same stock Apple 2.7.2 from OS X 10.8 as me.

It could very well be a bug with lxml or the underlying libxml, or the way they were installed. I know jdotjdot has lxml 2.3.6, so this could be a bug that's been fixed somewhere between 2.3.6 and 3.0.2. In fact, given that, according to the lxml website and the change notes for any version after 2.3.5, there is no 2.3.6, so whatever he has may be some kind of buggy release from very early on a canceled branch or something… I don't know his libxml version, or how either was installed, or what platform he's on, so it's hard to guess, but at least this is something that can be investigated.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!