Just one thing to note, a few people have mentioned pulling down the website as XML and then using XPath to iterate through the nodes. It's probably important to make sure you are working with a site that has been developed in XHTML to make sure that the HTML represents a well formed XML document.