How to extract keywords (tags) from text

问题

i am currently trying to implement a tagging engine in Java and searched for solutions to extract keywords/tag from texts (articles). I have found some solutions on stackoverflow suggesting to use Pointwise Mutual Information.

Solution 1

Solution 2

I cant use pyton and nltk so i have to implement it myself. But i dont know how to calculate the probabilities. The equation looks like this:

PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]

What i want to know is how to calculate P(term, doc)

I already have a lange text corpus and a collection of articles. The articles are not part of the corpus. The corpus is indexed with lucene.

Please help me out. Best regards.

回答1:

There are lot of algorithms for doing this:

open source tools:

kea(http://www.nzdl.org/Kea/) supervised approach uses training data and controlled vocabulary

maui indexer(http://code.google.com/p/maui-indexer/) it is basically extension of kea which provide facility to use encyclopedia for key phrase extraction.

carrot2(http://project.carrot2.org/) unsupervised approach for key phrase extraction. it supports lot of variation of input, output format and parameters for key phrase extraction.

mallet topic modeling module(http://mallet.cs.umass.edu/topics.php)

Stanford topic modeling tool (http://nlp.stanford.edu/software/tmt/tmt-0.3/)

Mahout clustering algorithms(http://mahout.apache.org/)

Commercial api:

Alchemy API(http://www.alchemyapi.com/api/keyword-extraction/)

zemanta API(http://www.zemanta.com/developer/)

yahoo term extraction api(http://developer.yahoo.com/contentanalysis/)

来源：https://stackoverflow.com/questions/14339290/how-to-extract-keywords-tags-from-text

标签