This is a Python and NLTK newbie question.
I want to find frequency of bigrams which occur more than 10 times together and have the highest PMI.
For this,
Do go through the tutorial at http://nltk.googlecode.com/svn/trunk/doc/howto/collocations.html for more usage of collocation functions in NLTK and also the math in https://en.wikipedia.org/wiki/Pointwise_mutual_information. Hope the following script helps you since your code question didnt specify what's the input.
# This is just a fancy way to create document.
# I assume you have your texts in a continuous string format
# where each sentence ends with a fullstop.
>>> from itertools import chain
>>> docs = ["this is a sentence", "this is a foo bar", "you are a foo bar", "yes , i am"]
>>> texts = list(chain(*[(j+" .").split() for j in [i for i in docs]]))
# This is the NLTK part
>>> from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder>>> bigram_measures= BigramAssocMeasures()
>>> finder BigramCollocationFinder.from_words(texts)
# This gets the top 20 bigrams according to PMI
>>> finder.nbest(bigram_measures.pmi,20)
[(',', 'i'), ('i', 'am'), ('yes', ','), ('you', 'are'), ('foo', 'bar'), ('this', 'is'), ('a', 'foo'), ('is', 'a'), ('a', 'sentence'), ('are', 'a'), ('bar', '.'), ('.', 'yes'), ('.', 'you'), ('am', '.'), ('sentence', '.'), ('.', 'this')]
PMI measures the association of two words by calculating the log ( p(x|y) / p(x) ), so it's not only about the frequency of a word occurrence or a set of words concurring together. To achieve high PMI, you need both:
Here's some extreme PMI examples.
let's say you have 100 words in the corpus, and if frequency is of a certain word X is 1 and it only occurs with another word Y only once, then:
p(x|y) = 1
p(x) = 1/100
PMI = log(1 / 1/100) = log 0.01 = -2
let's say you have 100 words in the corpus and if frequency of a certain word is 90 but it never occurs with another word Y, then the PMI is
p(x|y) = 0
p(x) = 90/100
PMI = log(0 / 90/100) = log 0 = -infinity
so in that sense the first scenario is >>> PMI between X,Y than the second scenario even though the frequency of the second word is very high.