I\'m trying to crawl about a thousand of web sites, from which I\'m interested in the html content only.
Then I transform the HTML into XML to be parsed with Xpath to ex
Wow. State of the art crawlers like the search engines use crawl and index 1 million URLs On a sinlge box a day. Sure the HTML to XML rendering step takes a bit but I agree with you on the performance. I've only used private crawlers so I can't recommend one you'll be able to use but hope this performance numbers help in your evaluation.