发表新帖

发表新帖

How to write a crawler?

后端未结

关注

 10  1868

感情败类 2020-12-02 03:47

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.

Does anybody have an

10条回答

盖世英雄少女心 (楼主)

2020-12-02 04:34
If your NPO's sites are relatively big or complex (having dynamic pages that'll effectively create a 'black hole' like a calendar with a 'next day' link) you'd be better using a real web crawler, like Heritrix.

If the sites total a few number of pages you can get away with just using curl or wget or your own. Just remember if they start to get big or you start making your script more complex to just use a real crawler or at least look at its source to see what are they doing and why.

Some issues (there are more):
- Black holes (as described)
- Retries (what if you get a 500?)
- Redirects
- Flow control (else you can be a burden on the sites)
- robots.txt implementation
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...

热议问题