Java Web Crawler Libraries

前端未结

关注

 12  1066

栀梦 2020-12-13 04:58

I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two impo

12条回答

暖寄归人 (楼主)

2020-12-13 05:43
Have a look at these existing projects if you want to learn how it can be done:
- Apache Nutch
- crawler4j
- gecco
- Norconex HTTP Collector
- vidageek crawler
- webmagic
- Webmuncher
A typical crawler process is a loop consisting of fetching, parsing, link extraction, and processing of the output (storing, indexing). Though the devil is in the details, i.e. how to be "polite" and respect robots.txt, meta tags, redirects, rate limits, URL canonicalization, infinite depth, retries, revisits, etc.

^{Flow diagram courtesy of Norconex HTTP Collector.}
0 讨论(0)

查看其它12个回答
发布评论:

提交评论
- 加载中...