How do I program bigram as a table in python?

I'm doing this homework, and I am stuck at this point. I can't program Bigram frequency in the English language, 'conditional probability' in python?

That is, the probability
of a token
given the preceding token
is equal to the probability of their bigram, or the co-occurrence of the two tokens
, divided by the probability of the preceding token.

I have a text with many letters, then I have calculated the probability for the letters in this text, so the letter 'a' appears 0.015% compared to the letters in the text.

The letters are from ^a-zA-Z, and what I want is:
How can I make a table with the lengths of the alphabet ((alphabet)x(alphabet)), and how do I calculate the conditional probability for every single situation?

It's like:

[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]  [(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]                     ...       ...  [(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]

and for this I should calculate the probability, like: What's the chances that you get the letter 'a' if you at this point have an letter 'a', and so on.

I can't get started, hope you can kickstart me, and hope that it's clear what I need to solve.

Assuming your file has no other punctuation (easy enough to strip out):

import itertools  def pairwise(s):     a,b = itertools.tee(s)     next(b)     return zip(a,b)  counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet with open('path/to/input') as infile:     for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text         given = ord(a) - ord('a')  # index (in `counts`) of the "given" character         char = ord(b) - ord('a')   # index of the character that follows the "given" character         counts[given][char] += 1  # now that we have the number of occurrences, let's divide by the totals to get conditional probabilities  totals = [sum(count[i] for i in range(52)) for count in counts] for given in range(52):     if not totals[given]:         continue     for i in range(len(counts[given])):         counts[given][i] /= totals[given]

I haven't tested this, but it should be a good start

Here's a dictionary version, which should be easier to read and debug:

counts = {} with open('path/to/input') as infile:     for a,b in pairwise(char for line in infile for word in line.split() for char in word):         given = ord(a) - ord('a')         char = ord(b) - ord('a')         if given not in counts:             counts[given] = {}         if char not in counts[given]:             counts[given][char] = 0         counts[given][char] += 1  answer = {} for given, chardict in answer.items():     total = sum(chardict.values())     for char, count in chardict.items():         answer[given][char] = count/total

Now, answer contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']

来源：https://stackoverflow.com/questions/27951634/how-do-i-program-bigram-as-a-table-in-python

标签

python

list

dictionary

markov-chains