What is the state of the art in HTML content extraction?

后端 未结 8 1038
無奈伤痛
無奈伤痛 2021-01-29 23:52

There\'s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g.,

8条回答
  •  广开言路
    2021-01-30 00:28

    One word: boilerpipe.

    For the news domain, on a representative corpus, we're now at 98% / 99% extraction accuracy (avg/median)

    • Demo: http://boilerpipe-web.appspot.com/
    • Code: http://code.google.com/p/boilerpipe/
    • Presentation: http://videolectures.net/wsdm2010_kohlschutter_bdu/
    • Dataset and slides: http://www.l3s.de/~kohlschuetter/boilerplate/
    • PhD thesis: http://www.kohlschutter.com/pdf/Dissertation-Kohlschuetter.pdf

    Also quite language independent (today, I've learned it works for Nepali, too).

    Disclaimer: I am the author of this work.

提交回复
热议问题