What is the state of the art in HTML content extraction?
There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages , and some signs of interest here, e.g., one , two , and three , but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice? Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for. Postscript the first : To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever)