Recommended HTML readability transcoding libraries in .Net [closed]

ぐ巨炮叔叔 提交于 2019-12-05 19:28:59
Benjamin Gruenbaum

I don't know if this is still relevant, but this is an interesting question I run into a lot, and I haven't seen much material on the web that covers it.

I've implemented a tool that does this over the span of several months myself. Out of contract obligation, I can not share this tool freely. However, I'm free to share some advice about what you can do.

The Sad Truth :(

I can assure you that we have tried every option before undertaking the task of creating a readability tool ourselves. At the moment no such tools exist that were satisfactory for what we needed.

So, you want to extract content?

Great! You will need a few things

  1. A tool for handling the page's HTML. I use CsQuery which is what Jamie suggested in the answer above. It works great for selecting elements.
  2. A programming language (That's C# in this example, any .NET language will do!)
  3. A tool that lets you download the pages themselves. CsQuery it on its own with createFromUrl. You can create your own helper class for downloading the page if you want to pre-process it and get finer grained control over the headers. (Try playing with the user agent, looking for mobile versions, etc)

Ok, I'm all set up, what's next?

There is surprisingly little research in the field of content extraction. A piece that stands out is Boilerplate Detection using Shallow Text Features. You can also read this answer here in StackOverflow from the paper's author to see how Readability works and what some approaches are.

Here are some more papers I enjoyed:

I'm done reading, what's done in practice

From my experience the following are good strategies for extracting content:

  • Simple heuristics: Filtering <header> and <nav> tags, removing lists with only links. Removing the entire <head> section. Giving negative/positive score to elements based on their name and removing the ones with the least score (for example, divs with a class that contains the name navigation might get get lower score). This is how readability works.

  • Meta-Content. Analyzing density of links to text, this is a powerful tool on its own, you can compare the amount of link text to html text and work with that, the most dense text is usually where the content is. CsQuery lets you compare the amount of text to the amount of text in nested link tags easily.

  • Templating. Crawl several pages on the same website and analyze the differences between them, the constants are usually the page layout, navigation and ads. You can usually filter based on similarities. This 'template' based approach is very effective. The trick is to come up with an efficient algorithm to keep track of templates and detect the template itself.

  • Natural language processing. This is probably the most advanced approach here, it is relatively simple with natural language processing tools to detect paragraphs, text structure and thus where the actual content starts and ends.

  • Learning, learning is a very powerful concept for this sort of task. In the most basic form this involves creating a program that 'guesses' HTML elements to remove on a set of pre-defined results from a website and learns which patterns is OK to remove. This approach works best on a machine-per-site from my experience.

  • Fixed list of selectors. Surprisingly, this is extremely potent and people tend to forget about it. If you are scraping from a specific few sites using selectors and manually extracting the content is probably the fastest thing to do. Keep it simple if you can :)

In Practice

Mix and match, a good solution usually involves more than one strategy, combining a few. We ended up with something quite complex because we use it for a complex task. In practice, content extraction is a really complicated task. Don't try creating something that is very general, stick to the content you need to scrape. Test a lot, unit tests and regression are very important for this sort of program, always compare and read the code of readability, it's pretty simple and it'll probably get you started.

Best of luck, let me know how this goes.

CsQuery: https://github.com/jamietre/csquery

It's a .NET 4 jQuery port. Getting rid of non-content nodes could be done a number of ways: the .Text method to just grab everything as a string; or filter for text nodes, e.g.

var dom = CQ.CreateFromUrl(someUrl); 
// or var dom = CQ.Create(htmlText);

IEnumerable<string> allTextStrings = dom.Select("*")
            .Contents()
            .Where(el => el.NodeType == NodeType.TEXT_NODE)
            .Select(el => el.NodeValue);

It works the same as jQuery, except, of course, you also have the .NET framework and LINQ to make your life easier. The Select selects all nodes in the DOM, then Contents selects all children of each (including text nodes). That's it for CsQuery; then with LINQ the Where filters for only text nodes, and the Select gets the actual text out of each node.

This will include a lot of whitespace, it returns everything. If you simply want a blob of text for the whole page, just

string text = dom.Select("body").Text();

will do it. The Text method coalesces whitespace so there will be a single space between each piece of actual text.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!