Shorten a text and only keep important sentences

The German website nandoo.net offers the possibility to shorten a news article. If you change the percentage value with a slider, the text changes and some sentences are left out.

You can see that in action here:

http://www.nandoo.net/read/article/299925/

The news article is on the left side and tags are marked. The slider is on the top of the second column. The more you move the slider to the left, the shorter the text becomes.

How can you offer something like that? Are there any algorithms which you can use to achieve that?

My idea was that their algorithm counts the number of tags and nouns in a sentence. Then the sentences with fewest number of tags/nouns are left out.

Could that be true? Or do you have another idea?

I hope you can help me. Thanks in advance!

Usually you want to keep the sentences that have words that are more unique to that article.

That is, the more "generic" the sentence is, the less it describes this particular article.

The normal way to do this is Bayesian analysis much like a spam-filter. First determine which words in the entire article appear more often than you'd expect, then find the sentences that feature those words.

This is a hot research topic in Computational Linguistics. The shallow approach, using Bayesian Filtering, is not likely to yield perfect results - but you probably don't need perfect results anyway.

In CL, the 80-20 rule quickly becomes the 95-5 rule, so if you are content with what you can achieve through shallow methods, skip this answer.

If you want to see whether you can improve on your results, you could try to find some better resources. The task you're referring to is called 'text summarization' in the research community, and it has its own web page which is hopelessly outdated. Mani and Maybury (1999) is probably a good overview (I haven't read it myself,) but also quite antiquated. More recent is Martin Hassels dissertation on the topic, and also quite exhaustive, including language-independent (read: statistical, i.e. shallow) methods.

As always, Google will be able to help you, too. Just search for text summarization.

来源：https://stackoverflow.com/questions/742711/shorten-a-text-and-only-keep-important-sentences

标签

algorithm

nlp

semantics