Algorithms or libraries for textual analysis, specifically: dominant words, phrases across text, and collection of text

无人久伴 提交于 2019-11-29 19:27:59

One option for what you're doing is term frequency to inverse document frequency, or tf-idf. The strongest terms will have the highest weighting under this calculation. Check if out here: http://en.wikipedia.org/wiki/Tf-idf

Another option is to use something like a naive bayes classifier using words as features and find what the strongest features are in the text to determine the class of the document. This would work similarly with a maximum entropy classifier.

As far as tools to do this, the best tool to start with would be NLTK, a Python library with extensive documentation and tutorials: http://nltk.sourceforge.net/

For Java, try OpenNLP: http://opennlp.sourceforge.net/

For the phrase stuff, consider the second option I offered up by using bigrams and trigrams as features, or even as terms in tf-idf.

Good luck!

To add to Robert Elwell's answer:

  • stemming and collapsing word forms. A simple method in english is to use Porter Stemming on the lower-cased word forms.
  • a term for the "common words" is "stop word" or "stop list"
  • Reading through the NLTK book as suggested will explain a lot of these introductory issues well.
  • some of the problems you have to tackle are parsing up sentences (so that your bigrams and n-gram phrases don't cross sentence boundaries), splitting up sentences into tokens, and deciding what to do about possessive forms for example.

None of this stuff is clear cut, nor does any of it have "correct answers". See also the "nlp" and "natural-language" SO tags.

Good luck! This is a non-trivial project.

Alrighty. So you've got a document containing text and a collection of documents (a corpus). There are a number of ways to do this.

I would suggest using the Lucene engine (Java) to index your documents. Lucene supports a data structure (Index) that maintains a number of documents in it. A document itself is a data structure that can contain "fields" - say, author, title, text, etc. You can choose which fields are indexed and which ones are not.

Adding documents to an index is trivial. Lucene is also built for speed, and can scale superbly.

Next, you want to figure out the terms and the frequencies. Since lucene has already calculated this for you during the indexing process, you can use either the docFreq function and build your own term frequency function, or use the IndexReader class's getTermFreqVectors function to get the terms (and frequencies thereof).

Now its up to you how to sort it and what criteria you want to use to filter the words you want. To figure out relationships, you can use a Java API to the wordnet open source library. To stem words, use Lucene's PorterStemFilter class. The phrase importance part is trickier, but once you've gotten this far - you can search for tips on how to integrate n-gram searching into Lucene (hint).

Good luck!

yogman

You could use Windows Indexing Service, which comes with the Windows Platform SDK. Or, just read the following introduction to get an overview of NLP.

http://msdn.microsoft.com/en-us/library/ms693179(VS.85).aspx http://i.msdn.microsoft.com/ms693179.wbr-index-create(en-us,VS.85).gif

http://i.msdn.microsoft.com/ms693179.wbr-query-process(en-us,VS.85).gif

Check MapReduce model to get the word count and then derive the frequency as described in tf-idf

Hadoop is a apache MapReduce framework that can be used for the heavy lifting task of word count on many documents. http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

You cannot get a single framework that would solve all you want. You have to choose a right combination of concepts and framework to get what you want.

Darius Bacon

I would also like to see if there is a way to identify important phrases. (Instead of a count of a word, the count of a phrase being 2-3 words together)

This part of your problem is called collocation extraction. (At least if you take 'important phrases' to be phrases that appear significantly more often than by chance.) I gave an answer over at another SO question about that specific subproblem.

It seems that what you are looking for is called bag-of-words document clustering/classification. You will find guidance with this search.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!