What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

本秂侑毒 提交于 2019-12-03 21:15:45

You can try Mechanize (http://wwwsearch.sourceforge.net/mechanize/) for programmatic web-browsing, and definitely use Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) for the scraping.

Most of us use urllib2 to get the page; it can handle various forms of authentication and cookie collection. Then Beautiful Soup to parse the results.

I once wrote a Python script to automatically log into vBulletin forums. The difficult part was knowing how to correctly form the login request and that is something that a library won't help you with. I found Live Http Headers - an addon for Firefox - to be pretty helpful in seeing what is sent between the client and server during the login process.

I also agree with everyone else that Beautiful Soup is pretty awesome.

i recommend using twill it makes it a snap to do the login procedure. then use beautifulsoup etc. as described above. ive never tried mechanize, but it looks pretty good.

just for screen scraping you can use combination of url lib + pyqyery. https://pythonhosted.org/pyquery/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!