How do I program bigram as a table in python?

无人久伴 提交于 2019-12-02 07:59:28

Assuming your file has no other punctuation (easy enough to strip out):

import itertools  def pairwise(s):     a,b = itertools.tee(s)     next(b)     return zip(a,b)  counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet with open('path/to/input') as infile:     for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text         given = ord(a) - ord('a')  # index (in `counts`) of the "given" character         char = ord(b) - ord('a')   # index of the character that follows the "given" character         counts[given][char] += 1  # now that we have the number of occurrences, let's divide by the totals to get conditional probabilities  totals = [sum(count[i] for i in range(52)) for count in counts] for given in range(52):     if not totals[given]:         continue     for i in range(len(counts[given])):         counts[given][i] /= totals[given] 

I haven't tested this, but it should be a good start

Here's a dictionary version, which should be easier to read and debug:

counts = {} with open('path/to/input') as infile:     for a,b in pairwise(char for line in infile for word in line.split() for char in word):         given = ord(a) - ord('a')         char = ord(b) - ord('a')         if given not in counts:             counts[given] = {}         if char not in counts[given]:             counts[given][char] = 0         counts[given][char] += 1  answer = {} for given, chardict in answer.items():     total = sum(chardict.values())     for char, count in chardict.items():         answer[given][char] = count/total 

Now, answer contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!