n-gram | 易学教程

Converting a list of tokens to n-grams

阅读更多关于 Converting a list of tokens to n-grams

I have a list of documents that have already been tokenized: dat <- list(c("texaco", "canada", "lowered", "contract", "price", "pay", "crude", "oil", "canadian", "cts", "barrel", "effective", "decrease", "brings", "companys", "posted", "price", "benchmark", "grade", "edmonton", "swann", "hills", "light", "sweet", "canadian", "dlrs", "bbl", "texaco", "canada", "changed", "crude", "oil", "postings", "feb", "reuter"), c("argentine", "crude", "oil", "production", "pct", "january", "mln", "barrels", "mln", "barrels", "january", "yacimientos", "petroliferos", "fiscales", "january", "natural", "gas",

Converting a list of tokens to n-grams

阅读更多关于 Converting a list of tokens to n-grams

问题 I have a list of documents that have already been tokenized: dat <- list(c("texaco", "canada", "lowered", "contract", "price", "pay", "crude", "oil", "canadian", "cts", "barrel", "effective", "decrease", "brings", "companys", "posted", "price", "benchmark", "grade", "edmonton", "swann", "hills", "light", "sweet", "canadian", "dlrs", "bbl", "texaco", "canada", "changed", "crude", "oil", "postings", "feb", "reuter"), c("argentine", "crude", "oil", "production", "pct", "january", "mln", "barrels

quicker way to detect n-grams in a string?

阅读更多关于 quicker way to detect n-grams in a string?

I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence ) import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(" "); for (int i = 0; i < words.length - n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] words, int start, int end) { StringBuilder sb = new StringBuilder(); for (int i = start; i < end; i++) sb.append((i > start ? " " : "") + words[i]); return sb.toString(); } public static

Bytes vs Characters vs Words - which granularity for n-grams?

阅读更多关于 Bytes vs Characters vs Words - which granularity for n-grams?

At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? Evaluate . The criterion for choosing the representation is whatever works . Indeed, character level (!=

Java Lucene NGramTokenizer

阅读更多关于 Java Lucene NGramTokenizer

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return String Objects. Here is the code that I have: Reader reader = new StringReader("This is a test string"); NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3); Where are the ngrams that were tokenized? How can I get the output in Strings/Words? I want my output to be like: This, is, a, test, string, This is, is a, a test, test string, This

python populate a shelve object/dictionary with multiple keys

阅读更多关于 python populate a shelve object/dictionary with multiple keys

问题 I have a list of 4-grams that I want to populate a dictionary object/shevle object with: ['I','go','to','work'] ['I','go','there','often'] ['it','is','nice','being'] ['I','live','in','NY'] ['I','go','to','work'] So that we have something like: four_grams['I']['go']['to']['work']=1 and any newly encountered 4-gram is populated with its four keys, with the value 1, and its value is incremented if it is encountered again. 回答1: You could do something like this: import shelve from collections

How to generate bi/tri-grams using spacy/nltk

阅读更多关于 How to generate bi/tri-grams using spacy/nltk

问题 The input text are always list of dish names where there are 1~3 adjectives and a noun Inputs thai iced tea spicy fried chicken sweet chili pork thai chicken curry outputs: thai tea, iced tea spicy chicken, fried chicken sweet pork, chili pork thai chicken, chicken curry, thai curry Basically, I am looking to parse the sentence tree and try to generate bi-grams by pairing an adjective with the noun. And I would like to achieve this with spacy or nltk 回答1: I used spacy 2.0 with english model.

Find the most frequently occuring words in a text in R

阅读更多关于 Find the most frequently occuring words in a text in R

问题 Can someone help me with how to find the most frequently used two and three words in a text using R? My text is... text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed

How to find the most common bi-grams with BigQuery?

阅读更多关于 How to find the most common bi-grams with BigQuery?

问题 I want to find the most common bi-grams (pair of words) in my table. How can I do this with BigQuery? 回答1: BigQuery now supports SPLIT(): SELECT word, nextword, COUNT(*) c FROM ( SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM ( SELECT created_utc, title, word, pos FROM FLATTEN( (SELECT created_utc, title, word, POSITION(word) pos FROM (SELECT created_utc, title, SPLIT(title, ' ') word FROM [bigquery-samples:reddit.full]) ), word) )) WHERE

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

阅读更多关于 How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

问题 I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size)) outputs: 4-grams: [u'like python it pretty', u'python it pretty awesome', u