language-model

do searching in a very big ARPA file in a very short time in java

情到浓时终转凉″ 提交于 2019-12-02 01:55:30
问题 I have an ARPA file which is almost 1 GB. I have to do searching in it in less than 1 minute. I have searched a lot, but I have not found the suitable answer yet. I think I do not have to read the whole file. I just have to jump to a specific line in the file and read the whole line. The lines of the ARPA file do not have the same length. I have to mention that ARPA files have a specific format. File Format \data\ ngram 1=19 ngram 2=234 ngram 3=1013 \1-grams: -1.7132 puluh -3.8008 -1.9782

NLTK package to estimate the (unigram) perplexity

为君一笑 提交于 2019-11-29 14:32:12
问题 I am trying to calculate the perplexity for the data I have. The code I am using is: import sys sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk") from nltk.corpus import brown from nltk.model import NgramModel from nltk.probability import LidstoneProbDist, WittenBellProbDist estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) lm = NgramModel(3, brown.words(categories='news'), True, False, estimator) print lm But I am receiving the error, File "/usr/local

How to compute skipgrams in python?

 ̄綄美尐妖づ 提交于 2019-11-28 06:24:51
A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python? Following is the code i tried but it is not doing as expected: <pre> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] def find_skipgrams(input_list, N,K): bigram_list = [] nlist=[] K=1 for k in range(K+1): for i in range(len(input_list)-1): if i+k+1<len(input_list): nlist=[] for j in range(N+1): if i+k+j+1<len(input_list): nlist.append(input_list[i+k+j+1]) bigram_list.append(nlist) return bigram

Building openears compatible language model

假装没事ソ 提交于 2019-11-27 18:24:23
I am doing some development on speech to text and text to speech and I found the OpenEars API very useful. The principle of this cmu-slm based API is it uses a language model to map the speech listened by the iPhone device. So I decided to find a big English language model to feed the API speech recognizer engine. But I failed to understand the format of the voxfourge english data model to use with OpenEars. Do anyone have any idea that how can I get the .languagemodel and .dic file for English language to work with OpenEars? Halle Old question, but maybe the answer is still interesting.

How to compute skipgrams in python?

跟風遠走 提交于 2019-11-27 05:40:06
问题 A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python? Following is the code i tried but it is not doing as expected: <pre> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] def find_skipgrams(input_list, N,K): bigram_list = [] nlist=[] K=1 for k in range(K+1): for i in range(len(input_list)-1): if i+k+1<len(input_list): nlist=[] for j in range(N+1):

Building openears compatible language model

☆樱花仙子☆ 提交于 2019-11-26 19:25:23
问题 I am doing some development on speech to text and text to speech and I found the OpenEars API very useful. The principle of this cmu-slm based API is it uses a language model to map the speech listened by the iPhone device. So I decided to find a big English language model to feed the API speech recognizer engine. But I failed to understand the format of the voxfourge english data model to use with OpenEars. Do anyone have any idea that how can I get the .languagemodel and .dic file for