n-gram

How to find the most common bi-grams with BigQuery?

只谈情不闲聊 提交于 2019-11-29 05:13:08
I want to find the most common bi-grams (pair of words) in my table. How can I do this with BigQuery? BigQuery now supports SPLIT(): SELECT word, nextword, COUNT(*) c FROM ( SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM ( SELECT created_utc, title, word, pos FROM FLATTEN( (SELECT created_utc, title, word, POSITION(word) pos FROM (SELECT created_utc, title, SPLIT(title, ' ') word FROM [bigquery-samples:reddit.full]) ), word) )) WHERE nextword IS NOT null GROUP EACH BY 1, 2 ORDER BY c DESC LIMIT 100 来源: https://stackoverflow.com/questions

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

时光总嘲笑我的痴心妄想 提交于 2019-11-29 04:48:45
I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size)) outputs: 4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it'] The punctuation is removed: how to include them as separate tokens? You should

Is there an alternate for the now removed module 'nltk.model.NGramModel'?

▼魔方 西西 提交于 2019-11-29 01:46:08
I've been trying to find out an alternative for two straight days now, and couldn't find anything relevant. I'm basically trying to get a probabilistic score of a synthesized sentence (synthesized by replacing some words from an original sentence picked from the corpora). I tried Collocations, but the scores that I'm getting aren't very helpful. So I tried making use of the language model concept, only to find that the seemingly helpful module 'model' has been removed from NLTK because of some bugs. It'd be really great if someone could either let me know about some alternate way to get the

Python NLTK: Bigrams trigrams fourgrams

不打扰是莪最后的温柔 提交于 2019-11-29 00:35:20
问题 I have this example and i want to know how to get this result. I have text and I tokenize it then I collect the bigram and trigram and fourgram like that import nltk from nltk import word_tokenize from nltk.util import ngrams text = "Hi How are you? i am fine and you" token=nltk.word_tokenize(text) bigrams=ngrams(token,2) bigrams: [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] trigrams=ngrams(token,3)

n-grams with Naive Bayes classifier

試著忘記壹切 提交于 2019-11-28 21:43:56
Im new to python and need help! i was practicing with python NLTK text classification. Here is the code example i am practicing on http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/ Ive tried this one from nltk import bigrams from nltk.probability import ELEProbDist, FreqDist from nltk import NaiveBayesClassifier from collections import defaultdict train_samples = {} with file ('positive.txt', 'rt') as f: for line in f.readlines(): train_samples[line]='pos' with file ('negative.txt', 'rt') as d: for line in d.readlines(): train_samples[line]='neg' f=open("test

N-Gram

感情迁移 提交于 2019-11-28 19:50:45
一、什么是N-Gram N-Gram是一种统计语言模型,用来根据前(n-1)个item来预测第n个item。在应用层面,这些item字符(输入法应用)等。一般来讲,可以从大规模文本或音频语料库生成N-Gram模型。 习惯上,1-gram称为unigram,2-gram称为bigram,3-gram是trigram。还有four-gram、five-gram等,不过大于n>5的应用很少见。 N-Gram语言模型的思想,可以追溯到信息论大师香农的研究工作,他提出一个问题:给定一串字母,如”for ex”,下一个最大可能性出现的字母是什么。从训练语料数据中,我们可以通过极大似然估计的方法,得到N个概率分布:是a的概率是0.4,是b的概率是0.0001,是c的概率是…,当然,别忘记约束条件:所有的N个概率分布的总和为1 N-Gram模型概率公式推导。根据条件概率和乘法公式:P(B|A)=P(A,B)P(A),假设一个序列T由A1,A2,A3,…,An组成,那么P(T)的概率为: P(A1A2A3…An)=P(A1)∗P(A2|A1)∗P(A3|A2,A1)∗…∗P(An|A1,A2,…,An−1)其中P(A1,A2,…,An−1)>0 如果直接这么计算,是有很大困难的,需要引入马尔科夫假设,即:一个item的出现概率,只与其前m个items有关,当m=0时,就是unigram,m=1时

Really fast word ngram vectorization in R

Deadly 提交于 2019-11-28 18:25:14
edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a character vector: #Takes about 15 seconds system.time({ set.seed(1) samplefun <- function(n, x, collapse){ paste(sample(x, n, replace=TRUE), collapse=collapse) } words <- sapply(rpois(10000, 3) + 1, samplefun, letters, '') sents1 <- sapply(rpois(1000000, 5) + 1, samplefun, words, ' ') }) I can convert this character data to a bag-of-words

Document-term matrix in R - bigram tokenizer not working

混江龙づ霸主 提交于 2019-11-28 08:21:30
问题 I am trying to make 2 document-term matrices for a corpus, one with unigrams and one with bigrams. However, the bigram matrix is currently just identical to the unigram matrix, and I'm not sure why. The code: docs<-Corpus(DirSource("data", recursive=TRUE)) # Get the document term matrices BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", removePunctuation = TRUE, stopwords = stopwords(

NLP系列文章:子词嵌入(fastText)的理解!(附代码)

耗尽温柔 提交于 2019-11-28 07:07:50
1. 什么是fastText 英语单词通常有其内部结构和形成⽅式。例如,我们可以从“dog”“dogs”和“dogcatcher”的字⾯上推测它们的关系。这些词都有同⼀个词根“dog”,但使⽤不同的后缀来改变词的含义。而且,这个关联可以推⼴⾄其他词汇。 在word2vec中,我们并没有直接利⽤构词学中的信息。⽆论是在跳字模型还是连续词袋模型中,我们都将形态不同的单词⽤不同的向量来表⽰。例如, “dog”和“dogs”分别⽤两个不同的向量表⽰,而模型中并未直接表达这两个向量之间的关系。鉴于此,fastText提出了⼦词嵌⼊(subword embedding)的⽅法,从而试图将构词信息引⼊word2vec中的CBOW。 这里有一点需要特别注意,一般情况下,使用fastText进行文本分类的同时也会产生词的embedding,即embedding是fastText分类的产物。除非你决定使用预训练的embedding来训练fastText分类模型,这另当别论。 2. n-gram表示单词 word2vec把语料库中的每个单词当成原子的,它会为每个单词生成一个向量。这忽略了单词内部的形态特征,比如:“book” 和“books”,“阿里巴巴”和“阿里”,这两个例子中,两个单词都有较多公共字符,即它们的内部形态类似,但是在传统的word2vec中

How to compute skipgrams in python?

 ̄綄美尐妖づ 提交于 2019-11-28 06:24:51
A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python? Following is the code i tried but it is not doing as expected: <pre> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] def find_skipgrams(input_list, N,K): bigram_list = [] nlist=[] K=1 for k in range(K+1): for i in range(len(input_list)-1): if i+k+1<len(input_list): nlist=[] for j in range(N+1): if i+k+j+1<len(input_list): nlist.append(input_list[i+k+j+1]) bigram_list.append(nlist) return bigram