n-gram | 易学教程

NGramTokenizer not working as expected

阅读更多关于 NGramTokenizer not working as expected

问题 I have a simple R code where I'm reading text from a file and plotting recurring phrases on a bar chart. For some reason, the bar chart only shows single words rather than multi worded phrases. Where am I going wrong? install.packages("xlsx") install.packages("tm") install.packages("wordcloud") install.packages("ggplot2") library(xlsx) library(tm) library(wordcloud) library(ggplot2) setwd("C://Users//608447283//desktop//R_word_charts") test <- Corpus(DirSource"C://Users//608447283//desktop//R

How to generate all n-grams in Hive

阅读更多关于 How to generate all n-grams in Hive

I'd like to create a list of n-grams using HiveQL. My idea was to use a regex with a lookahead and the split function - this does not work, though: select split('This is my sentence', '(\\S+) +(?=(\\S+))'); The input is a column of the form |sentence | |-------------------------| |This is my sentence | |This is another sentence | The output is supposed to be: ["This is","is my","my sentence"] ["This is","is another","another sentence"] There is an n-grams udf in Hive but the function directly calculates the frequency of the n-grams - I'd like to have a list of all the n-grams instead. Thanks a

N-gram模型

阅读更多关于 N-gram模型

N-gram模型（一）引言 N-gram是自然语言处理中常见一种基于统计的语言模型。它的基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作，形成了长度是N的字节片段序列。每一个字节片段称为gram，在所给语句中对所有的gram出现的频数进行统计。再根据整体语料库中每个gram出现的频数进行比对可以得到所给语句中每个gram出现的概率。N-gram在判断句子合理性、句子相似度比较、分词等方面有突出的表现。（二）朴素贝叶斯（Naive Bayes）首先我们复习一下一个非常基本的模型，朴素贝叶斯（Naive Bayes）。朴素贝叶斯的关键组成是贝叶斯公式与条件独立性假设。可以参考（ https://www.yuque.com/dadahuang/tvnnrr/gksobm ）。为了方便说明，我们举一个垃圾短信分类的例子：假如你的邮箱受到了一个垃圾邮件，里面的内容包含： “性感荷官在线发牌...” 根据朴素贝叶斯的目的是计算这句话属于垃圾短信敏感句子的概率。根据前面朴素贝叶斯的介绍，由可得： P(垃圾短信|“性感荷官在线发牌”) 正相关于 P(垃圾邮件)P(“性感荷官在线发牌”|垃圾短信) 由条件独立性假设可得： P(“****性感荷官在线发牌****”|垃圾短信) = ****P("性","感","荷","官","在","线","发","牌"****|垃圾短信)

Multi-word Term Vectors with Word nGrams?

阅读更多关于 Multi-word Term Vectors with Word nGrams?

问题 I'm aiming to build an index that, for each document, will break it down by word ngrams (uni, bi, and tri), then capture term vector analysis on all of those word ngrams. Is that possible with Elasticsearch? For instance, for a document field containing "The red car drives." I would be able to get the information: red - 1 instance car - 1 instance drives - 1 instance red car - 1 instance car drives - 1 instance red car drives - 1 instance Thanks in advance! 回答1: Assuming you already know

n-grams from text in python

阅读更多关于 n-grams from text in python

An update to my previous post , with some changes: Say that I have 100 tweets. In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction. I already have a lexicon with names, type and id-number: lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'drink', 'id': 'd_345'}, 'banana': {'type': 'food', 'id': 'f_456'}, 'banana split': {'type': 'food', 'id': 'f_567'}, 'cream': {'type': 'food', 'id': 'f_678'},

Is there a bi gram or tri gram feature in Spacy?

阅读更多关于 Is there a bi gram or tri gram feature in Spacy?

问题 The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is benefiting major manufacturing companies") for token in doc: print(token.text) What I would ideally want is, to read 'cloud computing' together as it is technically one word. Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi

Rails sunspot-solr - words with hyphen

阅读更多关于 Rails sunspot-solr - words with hyphen

I'm using the sunspot_rails gem and everything is working perfect so far but: I'm not getting any search results for words with a hyphen. Example: The string "tron" returns a lot of results(the word mentioned in all articles is e-tron) The string "e-tron" returns 0 results even though this is the correct word mentioned in all my articles. My current schema.xml config: <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/

2-gram and 3-gram instead of 1-gram using RWeka

阅读更多关于 2-gram and 3-gram instead of 1-gram using RWeka

I am trying to extract 1-gram, 2-gram and 3-gram from the train corpus, using RWeka NGramTokenizer function. Unfortunately, getting only 1-grams. There is my code: train_corpus # clean-up cleanset1<- tm_map(train_corpus, tolower) cleanset2<- tm_map(cleanset1, removeNumbers) cleanset3<- tm_map(cleanset2, removeWords, stopwords("english")) cleanset4<- tm_map(cleanset3, removePunctuation) cleanset5<- tm_map(cleanset4, stemDocument, language="english") cleanset6<- tm_map(cleanset5, stripWhitespace) # 1-gram NgramTokenizer1 <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1)) train_dtm

Is there a more efficient way to find most common n-grams?

阅读更多关于 Is there a more efficient way to find most common n-grams?

I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this? alvas In Python, using NLTK: $ wget http://norvig.com/big.txt $ python >>> from collections import Counter >>> from nltk import ngrams >>> bigtxt = open('big.txt').read() >>> ngram_counts = Counter(ngrams(bigtxt.split(), 2)) >>> ngram_counts.most_common(10) [(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065)

NGramTokenizer not working as expected

阅读更多关于 NGramTokenizer not working as expected

I have a simple R code where I'm reading text from a file and plotting recurring phrases on a bar chart. For some reason, the bar chart only shows single words rather than multi worded phrases. Where am I going wrong? install.packages("xlsx") install.packages("tm") install.packages("wordcloud") install.packages("ggplot2") library(xlsx) library(tm) library(wordcloud) library(ggplot2) setwd("C://Users//608447283//desktop//R_word_charts") test <- Corpus(DirSource"C://Users//608447283//desktop//R_word_charts//source")) test <- tm_map(test, stripWhitespace) test <- tm_map(test, tolower) test <- tm