n-gram

Converting a list of tokens to n-grams

夙愿已清 提交于 2019-12-01 11:10:41
I have a list of documents that have already been tokenized: dat <- list(c("texaco", "canada", "lowered", "contract", "price", "pay", "crude", "oil", "canadian", "cts", "barrel", "effective", "decrease", "brings", "companys", "posted", "price", "benchmark", "grade", "edmonton", "swann", "hills", "light", "sweet", "canadian", "dlrs", "bbl", "texaco", "canada", "changed", "crude", "oil", "postings", "feb", "reuter"), c("argentine", "crude", "oil", "production", "pct", "january", "mln", "barrels", "mln", "barrels", "january", "yacimientos", "petroliferos", "fiscales", "january", "natural", "gas",

Converting a list of tokens to n-grams

喜欢而已 提交于 2019-12-01 09:26:00
问题 I have a list of documents that have already been tokenized: dat <- list(c("texaco", "canada", "lowered", "contract", "price", "pay", "crude", "oil", "canadian", "cts", "barrel", "effective", "decrease", "brings", "companys", "posted", "price", "benchmark", "grade", "edmonton", "swann", "hills", "light", "sweet", "canadian", "dlrs", "bbl", "texaco", "canada", "changed", "crude", "oil", "postings", "feb", "reuter"), c("argentine", "crude", "oil", "production", "pct", "january", "mln", "barrels

quicker way to detect n-grams in a string?

梦想的初衷 提交于 2019-12-01 01:57:31
I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence ) import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(" "); for (int i = 0; i < words.length - n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] words, int start, int end) { StringBuilder sb = new StringBuilder(); for (int i = start; i < end; i++) sb.append((i > start ? " " : "") + words[i]); return sb.toString(); } public static

Bytes vs Characters vs Words - which granularity for n-grams?

北慕城南 提交于 2019-11-30 20:34:37
At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? Evaluate . The criterion for choosing the representation is whatever works . Indeed, character level (!=

Java Lucene NGramTokenizer

◇◆丶佛笑我妖孽 提交于 2019-11-30 17:39:59
I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return String Objects. Here is the code that I have: Reader reader = new StringReader("This is a test string"); NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3); Where are the ngrams that were tokenized? How can I get the output in Strings/Words? I want my output to be like: This, is, a, test, string, This is, is a, a test, test string, This

python populate a shelve object/dictionary with multiple keys

三世轮回 提交于 2019-11-30 16:25:56
问题 I have a list of 4-grams that I want to populate a dictionary object/shevle object with: ['I','go','to','work'] ['I','go','there','often'] ['it','is','nice','being'] ['I','live','in','NY'] ['I','go','to','work'] So that we have something like: four_grams['I']['go']['to']['work']=1 and any newly encountered 4-gram is populated with its four keys, with the value 1, and its value is incremented if it is encountered again. 回答1: You could do something like this: import shelve from collections

How to generate bi/tri-grams using spacy/nltk

狂风中的少年 提交于 2019-11-30 13:26:53
问题 The input text are always list of dish names where there are 1~3 adjectives and a noun Inputs thai iced tea spicy fried chicken sweet chili pork thai chicken curry outputs: thai tea, iced tea spicy chicken, fried chicken sweet pork, chili pork thai chicken, chicken curry, thai curry Basically, I am looking to parse the sentence tree and try to generate bi-grams by pairing an adjective with the noun. And I would like to achieve this with spacy or nltk 回答1: I used spacy 2.0 with english model.

Find the most frequently occuring words in a text in R

允我心安 提交于 2019-11-30 12:58:34
问题 Can someone help me with how to find the most frequently used two and three words in a text using R? My text is... text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed

How to find the most common bi-grams with BigQuery?

≡放荡痞女 提交于 2019-11-30 07:39:23
问题 I want to find the most common bi-grams (pair of words) in my table. How can I do this with BigQuery? 回答1: BigQuery now supports SPLIT(): SELECT word, nextword, COUNT(*) c FROM ( SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM ( SELECT created_utc, title, word, pos FROM FLATTEN( (SELECT created_utc, title, word, POSITION(word) pos FROM (SELECT created_utc, title, SPLIT(title, ' ') word FROM [bigquery-samples:reddit.full]) ), word) )) WHERE

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

久未见 提交于 2019-11-30 07:34:53
问题 I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size)) outputs: 4-grams: [u'like python it pretty', u'python it pretty awesome', u