发表新帖

发表新帖

How to write a crawler?

后端未结

关注

 10  1834

感情败类 2020-12-02 03:47

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.

Does anybody have an

10条回答

南方客 (楼主)

2020-12-02 04:28

Crawlers are simple in concept.

You get a root page via a HTTP GET, parse it to find URLs and put them on a queue unless they've been parsed already (so you need a global record of pages you have already parsed).

You can use the Content-type header to find out what the type of content is, and limit your crawler to only parsing the HTML types.

You can strip out the HTML tags to get the plain text, which you can do text analysis on (to get tags, etc, the meat of the page). You could even do that on the alt/title tags for images if you got that advanced.

And in the background you can have a pool of threads eating URLs from the Queue and doing the same. You want to limit the number of threads of course.

0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...

热议问题