I'm doing this homework, and I am stuck at this point. I can't program Bigram frequency in the English language, 'conditional probability' in python?
![]()
That is, the probability
of a token
given the preceding token
is equal to the probability of their bigram, or the co-occurrence of the two tokens
, divided by the probability of the preceding token.
I have a text with many letters, then I have calculated the probability for the letters in this text, so the letter 'a' appears 0.015%
compared to the letters in the text.
The letters are from ^a-zA-Z
, and what I want is:
How can I make a table with the lengths of the alphabet ((alphabet)x(alphabet)), and how do I calculate the conditional probability for every single situation?
It's like:
[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)] [(a|b),(b|b),(c|b),...,(z|b),...(Z|b)] ... ... [(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]
and for this I should calculate the probability, like: What's the chances that you get the letter 'a' if you at this point have an letter 'a', and so on.
I can't get started, hope you can kickstart me, and hope that it's clear what I need to solve.
Assuming your file has no other punctuation (easy enough to strip out):
import itertools def pairwise(s): a,b = itertools.tee(s) next(b) return zip(a,b) counts = [[0 for _ in range(52)] for _ in range(52)] # nothing has occurred yet with open('path/to/input') as infile: for a,b in pairwise(char for line in infile for word in line.split() for char in word): # get pairwise characters from the text given = ord(a) - ord('a') # index (in `counts`) of the "given" character char = ord(b) - ord('a') # index of the character that follows the "given" character counts[given][char] += 1 # now that we have the number of occurrences, let's divide by the totals to get conditional probabilities totals = [sum(count[i] for i in range(52)) for count in counts] for given in range(52): if not totals[given]: continue for i in range(len(counts[given])): counts[given][i] /= totals[given]
I haven't tested this, but it should be a good start
Here's a dictionary version, which should be easier to read and debug:
counts = {} with open('path/to/input') as infile: for a,b in pairwise(char for line in infile for word in line.split() for char in word): given = ord(a) - ord('a') char = ord(b) - ord('a') if given not in counts: counts[given] = {} if char not in counts[given]: counts[given][char] = 0 counts[given][char] += 1 answer = {} for given, chardict in answer.items(): total = sum(chardict.values()) for char, count in chardict.items(): answer[given][char] = count/total
Now, answer
contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']
来源:https://stackoverflow.com/questions/27951634/how-do-i-program-bigram-as-a-table-in-python