error with parse function in lxml

匿名 (未验证) 提交于 2019-12-03 01:31:01

问题:

i have installed lxml2.2.2 on windows platform(i m using python version 2.6.5).i tried this simple command:

from lxml.html import parse  p= parse(‘http://www.google.com’).getroot() 

but i am getting the following error:

Traceback (most recent call last): File “”, line 1, in p=parse(‘http://www.google.com’).getroot() File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw)  File “lxml.etree.pyx”, line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:49590)  File “parser.pxi”, line 1491, in lxml.etree.parseDocument (src/lxml/lxml.etree.c:71205) File “parser.pxi”, line 1520, in lxml.etree.parseDocumentFromURL (src/lxml/lxml.etree.c:71488)  File “parser.pxi”, line 1420, in lxml.etree.parseDocFromFile (src/lxml/lxml.etree.c:70583) File “parser.pxi”, line 975, in lxml.etree.BaseParser.parseDocFrom File (src/lxml/lxml.etree.c:67736) File “parser.pxi”, line 539, in lxml.etree.ParserContext.handleParseResultDoc (src/lxml/lxml.etree.c:63820)  File “parser.pxi”, line 625, in lxml.etree.handleParseResult (src/lxml/lxml.etree.c:64741) File “parser.pxi”, line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64056) IOError: Error reading file ‘http://www.google.com’: failed to load external entity “http://www.google.com” 

i am clueless as to what to do next as i am a newbie to python. please guide me to solve this error. thanks in advance!! :)

回答1:

lxml.html.parse does not fetch URLs.

Here's how to do it with urllib2:

>>> from urllib2 import urlopen >>> from lxml.html import parse >>> page = urlopen('http://www.google.com') >>> p = parse(page) >>> p.getroot() 

Update
Steven is right. lxml.etree.parse should accept and load URLs. I missed that. I've tried deleting this answer, but I'm not allowed.

I retract my statement about it not fetching URLs.



回答2:

According to the api docs it should work: http://lxml.de/api/lxml.html-module.html#parse

This seems to be a bug in lxml 2.2.2. I just tested on windows with python 2.6 and 2.7 and it does work with 2.3.0.

So: upgrade your lxml and you'll be fine.

I don't know exactly in which versions of lxml the problem occurs, but I believe the problem was not so much with lxml itself, but with the version of libxml2 used to build the windows binary. (certain versions of libxml2 had a problem with http on windows)



回答3:

Since line breaks are not allowed in comments, here's my implementation of MattH's answer:

from urllib2 import urlopen from lxml.html import parse  site_url = ('http://www.google.com')  try:     page = parse(site_url).getroot() except IOError:     page = parse(urlopen(site_url)).getroot() 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!