I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two impo
Have a look at these existing projects if you want to learn how it can be done:
A typical crawler process is a loop consisting of fetching, parsing, link extraction, and processing of the output (storing, indexing). Though the devil is in the details, i.e. how to be "polite" and respect robots.txt, meta tags, redirects, rate limits, URL canonicalization, infinite depth, retries, revisits, etc.
Flow diagram courtesy of Norconex HTTP Collector.