Extracting pure content / text from HTML Pages by excluding navigation and chrome content

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc

I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.

Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in

Can you

Suggest an alternative strategy for extraction of pure content,
Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?
How would you approach the above problem ?.
Are these any research papers on the same ?.

Regards

Ankur Gupta

For question (1), I am not sure. I haven't done this before. Maybe one of the other answers will help.

For question (2), automatic creation of abstracts is not a developed field. It is usually referred to as 'sentence selection', because the typical approach right now is to just select entire sentences.

For question (3), the basic way to create abstracts from machine learning would be to:

Create a corpus of existing abstracts
Annotate the abstracts in a useful way. For example, you'd probably want to indicate whether each sentence in the original was chosen and why (or why not).
Train a classifier of some sort on the corpus, then use it to classify the sentences in new articles.

My favourite reference on machine learning is Tom Mitchell's Machine Learning. It lists a number of ways to implement step (3).

For question (4), I am sure there are a few papers because my advisor mentioned it last year, but I do not know where to start since I'm not an expert in the field.

You might have a look at my boilerpipe project on Google Code and test it on pages of your choice using the live web app on Google AppEngine (linked from there).

I am researching this area and have written some papers about content extraction/boilerplate removal from HTML pages. See for example "Boilerplate Detection using Shallow Text Features" and watch the corresponding video on VideoLectures.net. The paper should give you a good overview of the state of the art in this area.

Cheers,

Christian

I don't know how it works, but check out Readability. It does exactly what you wanted.

来源：https://stackoverflow.com/questions/1696914/extracting-pure-content-text-from-html-pages-by-excluding-navigation-and-chrom

标签

html