If you are going to builld a crawler you need to (Java specific):
- Learn how to use the java.net.URL and java.net.URLConnection classes or use the HttpClient library
- Understand http request/response headers
- Understand redirects (both HTTP, HTML and Javascript)
- Understand content encodings (charsets)
- Use a good library for parsing badly formed HTML (e.g. cyberNecko, Jericho, JSoup)
- Make concurrent HTTP requests to different hosts, but ensure you issue no more than one to the same host every ~5 seconds
- Persist pages you have fetched, so you don't need to refetch them every day if they
don't change that often (HBase can be useful).
- A way of extracting links from the current page to crawl next
- Obey robots.txt
A bunch of other stuff too.
It's not that difficult, but there are lots of fiddly edge cases (e.g. redirects, detecting encoding (checkout Tika)).
For more basic requirements you could use wget.
Heretrix is another option, but yet another framework to learn.
Identifying About us pages can be done using various heuristics:
- inbound link text
- page title
- content on page
- URL
if you wanted to be more quantitative about it you could use machine learning and a classifier (maybe Bayesian).
Saving the front page is obviously easier but front page redirects (sometimes to different domains, and often implemented in the HTML meta redirect tag or even JS) are very common so you need to handle this.