I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.
Does anybody have an
Multithreaded Web Crawler
If you want to crawl large sized website then you should write a multi-threaded crawler. connecting,fetching and writing crawled information in files/database - these are the three steps of crawling but if you use a single threaded than your CPU and network utilization will be pour.
A multi threaded web crawler needs two data structures- linksVisited(this should be implemented as a hashmap or trai) and linksToBeVisited(this is a queue).
Web crawler uses BFS to traverse world wide web.
Algorithm of a basic web crawler:-
repeat step 2 to 5 until queue is linksToBeVisited empty.
Here is a code snippet on how to synchronize the threads....
public void add(String site) {
synchronized (this) {
if (!linksVisited.contains(site)) {
linksToBeVisited.add(site);
}
}
}
public String next() {
if (linksToBeVisited.size() == 0) {
return null;
}
synchronized (this) {
// Need to check again if size has changed
if (linksToBeVisited.size() > 0) {
String s = linksToBeVisited.get(0);
linksToBeVisited.remove(0);
linksVisited.add(s);
return s;
}
return null;
}
}