tag generation from a text content

前端未结

关注

 5  1603

北荒 2020-11-29 15:56

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.

5条回答

离开以前 (楼主)

2020-11-29 16:22
One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.

To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.

To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.

If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.

Borrowing from my answer to that question, the NLTK collocations how-to covers how to do extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:
```
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
   nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3) 

# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)  
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...