Where shall I start in making a scraper or a bot using python? [closed]

隐身守侯 提交于 2019-11-30 04:18:00

If you’re trying to access websites that make heavy use of JavaScript, you might, overall, find Selenium easier.

Selenium is a server that controls actual web browsers on your server, and a client library (including a Python port) that allows you to control the browsers and inspect the pages in them.

It’s definitely more overhead up-front to configure (and figure out) the server and client library (and to make sure you have a working browser on your system), but if the website does a lot of stuff in JavaScript, your actual scraping code could be a lot less hairy.

Screen scraping involves a lot of regular expressions to get the exact data you want. You also want to know what sort of data you want to analyze and how you want to store it.

To get the pages, you'll need to utilize libraries such as urllib (or urllib2) and regular expressions (re) or a good script to use is beautifulsoup to do your dirty work (http://www.crummy.com/software/BeautifulSoup/)

If you want to build a pure bot that does what the search engines do, you also have to build a smart enough bot to know that you don't keep pinging the same domain continuously (results in a DOS attack).

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!