NLTK - Counting Frequency of Bigram

前端未结

关注

 2  589

星月不相逢 2020-12-08 11:22

This is a Python and NLTK newbie question.

I want to find frequency of bigrams which occur more than 10 times together and have the highest PMI.

For this,

2条回答

猫巷女王i (楼主)

2020-12-08 12:01
Do go through the tutorial at http://nltk.googlecode.com/svn/trunk/doc/howto/collocations.html for more usage of collocation functions in NLTK and also the math in https://en.wikipedia.org/wiki/Pointwise_mutual_information. Hope the following script helps you since your code question didnt specify what's the input.
```
# This is just a fancy way to create document. 
# I assume you have your texts in a continuous string format
# where each sentence ends with a fullstop.
>>> from itertools import chain
>>> docs = ["this is a sentence", "this is a foo bar", "you are a foo bar", "yes , i am"]
>>> texts = list(chain(*[(j+" .").split() for j in [i for i in docs]]))

# This is the NLTK part
>>> from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder>>> bigram_measures= BigramAssocMeasures()
>>> finder  BigramCollocationFinder.from_words(texts)
# This gets the top 20 bigrams according to PMI
>>> finder.nbest(bigram_measures.pmi,20)
[(',', 'i'), ('i', 'am'), ('yes', ','), ('you', 'are'), ('foo', 'bar'), ('this', 'is'), ('a', 'foo'), ('is', 'a'), ('a', 'sentence'), ('are', 'a'), ('bar', '.'), ('.', 'yes'), ('.', 'you'), ('am', '.'), ('sentence', '.'), ('.', 'this')]
```
PMI measures the association of two words by calculating the log ( p(x|y) / p(x) ), so it's not only about the frequency of a word occurrence or a set of words concurring together. To achieve high PMI, you need both:
1. High p(x|y)
2. low p(x)
Here's some extreme PMI examples.

let's say you have 100 words in the corpus, and if frequency is of a certain word X is 1 and it only occurs with another word Y only once, then:
```
p(x|y) = 1
p(x) = 1/100
PMI = log(1 / 1/100) = log 0.01 = -2
```
let's say you have 100 words in the corpus and if frequency of a certain word is 90 but it never occurs with another word Y, then the PMI is
```
p(x|y) = 0
p(x) = 90/100
PMI = log(0 / 90/100) = log 0 = -infinity
```
so in that sense the first scenario is >>> PMI between X,Y than the second scenario even though the frequency of the second word is very high.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...