n-gram

How to extract character ngram from sentences? - python

橙三吉。 提交于 2019-12-03 15:45:40
The following word2ngrams function extracts character 3grams from a word: >>> x = 'foobar' >>> n = 3 >>> [x[i:i+n] for i in range(len(x)-n+1)] ['foo', 'oob', 'oba', 'bar'] This post shows the character ngrams extraction for a single word, Quick implementation of character n-grams using python . But what if i have sentences and i want to extract the character ngrams, is there a faster method other than iteratively call the word2ngram() ? What will be the regex version of achieving the same word2ngram and sent2ngram output? would it be faster? I've tried: import string, random, time from

How to generate n-grams in scala?

拥有回忆 提交于 2019-12-03 13:10:08
问题 I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees". First it has to pick a random n-gram. For example, the bee. Then it has to look for n-grams starting with (n-1) words. For example, bee of. it prints the last word of this n-gram. Then repeats. Can you please give me some hints how to do it? Sorry for the inconvenience. 回答1: Your questions could be a

Can Drupal's search module search for a substring? (Partial Search)

大憨熊 提交于 2019-12-03 12:40:13
Drupal's core search module, only searches for keywords, e.g. "sandwich". Can I make it search with a substring e.g. "sandw" and return my sandwich-results? Maybe there is a plugin that does that? The most direct module for it is probably Fuzzy Search . I have not tried it. If you have more advanced search needs on a small to medium sized site, Search Lucene API is a fine solution. For a larger site, or truly advanced needs, Solr is the premiere solution. Recently I made a patch for Drupal's core search module to provide it with partial search (aka n-gram searches) ability. This is tested

Generate bigrams with NLTK

﹥>﹥吖頭↗ 提交于 2019-12-03 10:53:39
I am trying to produce a bigram list of a given sentence for example, if I type, To be or not to be I want the program to generate to be, be or, or not, not to, to be I tried the following code but just gives me <generator object bigrams at 0x0000000009231360> This is my code: import nltk bigrm = nltk.bigrams(text) print(bigrm) So how do I get what I want? I want a list of combinations of the words like above (to be, be or, or not, not to, to be). nltk.bigrams() returns an iterator (a generator specifically) of bigrams. If you want a list, pass the iterator to list() . It also expects a

R and tm package: create a term-document matrix with a dictionary of one or two words?

本秂侑毒 提交于 2019-12-03 08:55:08
Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses NGramTokenizer in the RWeka package in R , but I ran into a problem . In the example code below, I create

Extract keyphrases from text (1-4 word ngrams)

早过忘川 提交于 2019-12-03 04:35:21
问题 What's the best way to extract keyphrases from a block of text? I'm writing a tool to do keyword extraction: something like this. I've found a few libraries for Python and Perl to extract n-grams, but I'm writing this in Node so I need a JavaScript solution. If there aren't any existing JavaScript libraries, could someone explain how to do this so I can just write it myself? 回答1: I like the idea, so I've implemented it: See below (descriptive comments are included). Preview at: http://fiddle

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

十年热恋 提交于 2019-12-03 04:14:54
问题 I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer. Running this code: from sklearn.feature_extraction.text import CountVectorizer vocabulary = ['hi ', 'bye', 'run away'] cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2)) print cv.vocabulary_ gives me: {'hi ': 0, 'bye': 1, 'run away': 2} Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams

Creating ARPA language model file with 50,000 words

强颜欢笑 提交于 2019-12-03 03:54:52
问题 I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words? 回答1: I thought I'd answer this one since it has a few votes, although based on Christina's other questions I don't think this will be a usable answer for her since a 50,000-word language model almost certainly won't have an acceptable word error rate or

How to generate n-grams in scala?

筅森魡賤 提交于 2019-12-03 03:18:37
I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees". First it has to pick a random n-gram. For example, the bee. Then it has to look for n-grams starting with (n-1) words. For example, bee of. it prints the last word of this n-gram. Then repeats. Can you please give me some hints how to do it? Sorry for the inconvenience. Your questions could be a little more specific but here is my try. val words = "the bee is the bee of the bees" words.split(' ').sliding(2)

How to get n-gram collocations and association in python nltk?

匿名 (未验证) 提交于 2019-12-03 02:50:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: In this documentation , there is example using nltk.collocations.BigramAssocMeasures() , BigramCollocationFinder , nltk.collocations.TrigramAssocMeasures() , and TrigramCollocationFinder . There is example method find nbest based on pmi for bigram and trigram. example: finder = BigramCollocationFinder.from_words( ... nltk.corpus.genesis.words('english-web.txt')) >>> finder.nbest(bigram_measures.pmi, 10) I know that BigramCollocationFinder and TrigramCollocationFinder inherit from AbstractCollocationFinder. While BigramAssocMeasures() and