I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.
Does anybody have an
Crawlers are simple in concept.
You get a root page via a HTTP GET, parse it to find URLs and put them on a queue unless they've been parsed already (so you need a global record of pages you have already parsed).
You can use the Content-type header to find out what the type of content is, and limit your crawler to only parsing the HTML types.
You can strip out the HTML tags to get the plain text, which you can do text analysis on (to get tags, etc, the meat of the page). You could even do that on the alt/title tags for images if you got that advanced.
And in the background you can have a pool of threads eating URLs from the Queue and doing the same. You want to limit the number of threads of course.