error with parse function in lxml

半城伤御伤魂 提交于 2019-11-29 01:45:27

lxml.html.parse does not fetch URLs.

Here's how to do it with urllib2:

>>> from urllib2 import urlopen
>>> from lxml.html import parse
>>> page = urlopen('http://www.google.com')
>>> p = parse(page)
>>> p.getroot()
<Element html at 1304050>

Update
Steven is right. lxml.etree.parse should accept and load URLs. I missed that. I've tried deleting this answer, but I'm not allowed.

I retract my statement about it not fetching URLs.

According to the api docs it should work: http://lxml.de/api/lxml.html-module.html#parse

This seems to be a bug in lxml 2.2.2. I just tested on windows with python 2.6 and 2.7 and it does work with 2.3.0.

So: upgrade your lxml and you'll be fine.

I don't know exactly in which versions of lxml the problem occurs, but I believe the problem was not so much with lxml itself, but with the version of libxml2 used to build the windows binary. (certain versions of libxml2 had a problem with http on windows)

bmaupin

Since line breaks are not allowed in comments, here's my implementation of MattH's answer:

from urllib2 import urlopen
from lxml.html import parse

site_url = ('http://www.google.com')

try:
    page = parse(site_url).getroot()
except IOError:
    page = parse(urlopen(site_url)).getroot()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!