问题
I am using lxml to parse html files given urls.
For example:
link = 'https://abc.com/def'
htmltree = lxml.html.parse(link)
My code is working well for most of the cases, the ones with http://. However, I found for every https:// url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem?
BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.
回答1:
I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with urllib2, though:
from lxml import html
from urllib2 import urlopen
html.parse(urlopen('https://duckduckgo.com'))
回答2:
From the lxml documentation:
lxml can parse from a local file, an HTTP URL or an FTP URL
I don't see HTTPS in that sentence anywhere, so I assume it is not supported.
An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as urllib2, and pass the retrieved document as a string to lxml.
来源:https://stackoverflow.com/questions/7882673/what-is-the-deal-about-https-when-using-lxml