I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.
Does anybody have an
You'll be reinventing the wheel, to be sure. But here's the basics:
Put these in persistent storage, so you can stop and start the crawler without losing state.
Algorithm is:
while(list of unvisited URLs is not empty) {
take URL from list
remove it from the unvisited list and add it to the visited list
fetch content
record whatever it is you want to about the content
if content is HTML {
parse out URLs from links
foreach URL {
if it matches your rules
and it's not already in either the visited or unvisited list
add it to the unvisited list
}
}
}