Extracting pure content / text from HTML Pages by excluding navigation and chrome content

眉间皱痕 提交于 2019-12-01 07:02:16

For question (1), I am not sure. I haven't done this before. Maybe one of the other answers will help.

For question (2), automatic creation of abstracts is not a developed field. It is usually referred to as 'sentence selection', because the typical approach right now is to just select entire sentences.

For question (3), the basic way to create abstracts from machine learning would be to:

  1. Create a corpus of existing abstracts
  2. Annotate the abstracts in a useful way. For example, you'd probably want to indicate whether each sentence in the original was chosen and why (or why not).
  3. Train a classifier of some sort on the corpus, then use it to classify the sentences in new articles.

My favourite reference on machine learning is Tom Mitchell's Machine Learning. It lists a number of ways to implement step (3).

For question (4), I am sure there are a few papers because my advisor mentioned it last year, but I do not know where to start since I'm not an expert in the field.

You might have a look at my boilerpipe project on Google Code and test it on pages of your choice using the live web app on Google AppEngine (linked from there).

I am researching this area and have written some papers about content extraction/boilerplate removal from HTML pages. See for example "Boilerplate Detection using Shallow Text Features" and watch the corresponding video on VideoLectures.net. The paper should give you a good overview of the state of the art in this area.

Cheers,

Christian

I don't know how it works, but check out Readability. It does exactly what you wanted.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!