Java Web Crawler Libraries

前端 未结 12 1065
栀梦
栀梦 2020-12-13 04:58

I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two impo

12条回答
  •  暖寄归人
    2020-12-13 05:43

    Have a look at these existing projects if you want to learn how it can be done:

    • Apache Nutch
    • crawler4j
    • gecco
    • Norconex HTTP Collector
    • vidageek crawler
    • webmagic
    • Webmuncher

    A typical crawler process is a loop consisting of fetching, parsing, link extraction, and processing of the output (storing, indexing). Though the devil is in the details, i.e. how to be "polite" and respect robots.txt, meta tags, redirects, rate limits, URL canonicalization, infinite depth, retries, revisits, etc.

    Flow diagram courtesy of Norconex HTTP Collector.

提交回复
热议问题