Extracting and parsing HTML from a secure website with Python?

。_饼干妹妹 提交于 2019-12-10 09:50:49

问题


Let's dive into this, shall we?

Ok, I need to write a script (I don't care what language, prefer something like Python or Javascript, but whatever works I will take time to learn). The script will access multiple URL's, extract text from each site and store it into a folder on my PC. (From there I am manipulating the data with Python, which I know how to do.)

EDIT: Currently I am using python's NLTK module. Here is a simple version of my code:

url  = "<URL HERE>"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)

This code works fine for both http and https, but not for instances where authentication is required.

Is there a Python module which deals with secure authentication?

Thanks in advance for help! And to the mods who will view this as a bad question, please just give me ways to make it better. I need ideas..from people, not Google.


回答1:


Mechanize (2) is one option, other is just with urllib2



来源:https://stackoverflow.com/questions/18134834/extracting-and-parsing-html-from-a-secure-website-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!