问题
i am currently trying to implement a tagging engine in Java and searched for solutions to extract keywords/tag from texts (articles). I have found some solutions on stackoverflow suggesting to use Pointwise Mutual Information.
Solution 1
Solution 2
I cant use pyton and nltk so i have to implement it myself. But i dont know how to calculate the probabilities. The equation looks like this:
PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]
What i want to know is how to calculate P(term, doc)
I already have a lange text corpus and a collection of articles. The articles are not part of the corpus. The corpus is indexed with lucene.
Please help me out. Best regards.
回答1:
There are lot of algorithms for doing this:
open source tools:
kea(http://www.nzdl.org/Kea/) supervised approach uses training data and controlled vocabulary
maui indexer(http://code.google.com/p/maui-indexer/) it is basically extension of kea which provide facility to use encyclopedia for key phrase extraction.
carrot2(http://project.carrot2.org/) unsupervised approach for key phrase extraction. it supports lot of variation of input, output format and parameters for key phrase extraction.
mallet topic modeling module(http://mallet.cs.umass.edu/topics.php)
Stanford topic modeling tool (http://nlp.stanford.edu/software/tmt/tmt-0.3/)
Mahout clustering algorithms(http://mahout.apache.org/)
Commercial api:
Alchemy API(http://www.alchemyapi.com/api/keyword-extraction/)
zemanta API(http://www.zemanta.com/developer/)
yahoo term extraction api(http://developer.yahoo.com/contentanalysis/)
来源:https://stackoverflow.com/questions/14339290/how-to-extract-keywords-tags-from-text