Given a document, select a relevant snippet

前提是你 提交于 2019-12-02 17:45:22

Automatic Text Summarization

It sounds like you're interested in automatic text summarization. For a nice overview of the problem, issues involved, and available algorithms, take a look at Das and Martin's paper A Survey on Automatic Text Summarization (2007).

Simple Algorithm

A simple but reasonably effective summarization algorithm is to just select a limited number of sentences from the original text that contain the most frequent content words (i.e., the most frequent ones not including stop list words).

Summarizer(originalText, maxSummarySize):
   // start with the raw freqs, e.g. [(10,'the'), (3,'language'), (8,'code')...]
   wordFrequences = getWordCounts(originalText)
   // filter, e.g. [(3, 'language'), (8, 'code')...]
   contentWordFrequences = filtStopWords(wordFrequences)
   // sort by freq & drop counts, e.g. ['code', 'language'...]
   contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences)

   // Split Sentences
   sentences = getSentences(originalText)

   // Select up to maxSummarySize sentences
   setSummarySentences = {}
   foreach word in contentWordsSortbyFreq:
      firstMatchingSentence = search(sentences, word)
      setSummarySentences.add(firstMatchingSentence)
      if setSummarySentences.size() = maxSummarySize:
         break

   // construct summary out of select sentences, preserving original ordering
   summary = ""
   foreach sentence in sentences:
     if sentence in setSummarySentences:
        summary = summary + " " + sentence

   return summary

Some open source packages that do summarization using this algorithm are:

Classifier4J (Java)

If you're using Java, you can use Classifier4J's module SimpleSummarizer.

Using the example found here, let's assume the original text is:

Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don't think there are any other java summarisers.

As seen in the following snippet, you can easily create a simple one sentence summary:

// Request a 1 sentence summary
String summary = summariser.summarise(longOriginalText, 1);

Using the algorithm above, this will produce Classifier4J includes a summariser..

NClassifier (C#)

If you're using C#, there's a port of Classifier4J to C# called NClassifier

Tristan Havelick's Summarizer for NLTK (Python)

There's a work-in-progress Python port of Classifier4J's summarizer built with Python's Natural Language Toolkit (NLTK) available here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!