i have installed lxml2.2.2 on windows platform(i m using python version 2.6.5).i tried this simple command:
from lxml.html import parse
p= parse(‘http://ww
According to the api docs it should work: http://lxml.de/api/lxml.html-module.html#parse
This seems to be a bug in lxml 2.2.2. I just tested on windows with python 2.6 and 2.7 and it does work with 2.3.0.
So: upgrade your lxml and you'll be fine.
I don't know exactly in which versions of lxml the problem occurs, but I believe the problem was not so much with lxml itself, but with the version of libxml2 used to build the windows binary. (certain versions of libxml2 had a problem with http on windows)
lxml.html.parse does not fetch URLs.
Here's how to do it with urllib2:
>>> from urllib2 import urlopen
>>> from lxml.html import parse
>>> page = urlopen('http://www.google.com')
>>> p = parse(page)
>>> p.getroot()
<Element html at 1304050>
Update
Steven is right. lxml.etree.parse should accept and load URLs. I missed that. I've tried deleting this answer, but I'm not allowed.
I retract my statement about it not fetching URLs.
Since line breaks are not allowed in comments, here's my implementation of MattH's answer:
from urllib2 import urlopen
from lxml.html import parse
site_url = ('http://www.google.com')
try:
page = parse(site_url).getroot()
except IOError:
page = parse(urlopen(site_url)).getroot()