What is the deal about https when using lxml?

问题

I am using lxml to parse html files given urls.

For example:

link = 'https://abc.com/def'
htmltree = lxml.html.parse(link)

My code is working well for most of the cases, the ones with http://. However, I found for every https:// url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem?

BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.

回答1:

I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with urllib2, though:

from lxml import html
from urllib2 import urlopen

html.parse(urlopen('https://duckduckgo.com'))

回答2:

From the lxml documentation:

lxml can parse from a local file, an HTTP URL or an FTP URL

I don't see HTTPS in that sentence anywhere, so I assume it is not supported.

An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as urllib2, and pass the retrieved document as a string to lxml.

来源：https://stackoverflow.com/questions/7882673/what-is-the-deal-about-https-when-using-lxml

标签

python

parsing

lxml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!