There\'s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g.,
One word: boilerpipe.
For the news domain, on a representative corpus, we're now at 98% / 99% extraction accuracy (avg/median)
Also quite language independent (today, I've learned it works for Nepali, too).
Disclaimer: I am the author of this work.