I have come across an interview question \"If you were designing a web crawler, how would you avoid getting into infinite loops? \" and I am trying to answer it.
How
The crawler keeps a URL pool that contains all the URLs to be crawled. To avoid “infinite loop”, the basic idea is to check the existence of each URL before adding to the pool.
However, this is not easy to implement when the system has scaled to certain level. The naive approach is keeping all the URLs in a hashset and check existence of each new URL. This won't work when there are too many URLs to fit into memory.
There are a couple of solutions here. For instance, instead of storing all the URLs into memory, we should keep them in disk. To save space, URL hash should be used instead of the raw URL. It's also worth to note that we should keep the canonical form of URL rather than the original one. So if the URL is shortened by services like bit.ly, it's better to get the final URL. To speed up the checking process, a cache layer can be built. Or you can see it as a distributed cache system, which is a separate topic.
The post Build a Web Crawler has a detailed analysis of this problem.