n-gram

NGramTokenizer not working as expected

和自甴很熟 提交于 2019-12-07 11:14:38
问题 I have a simple R code where I'm reading text from a file and plotting recurring phrases on a bar chart. For some reason, the bar chart only shows single words rather than multi worded phrases. Where am I going wrong? install.packages("xlsx") install.packages("tm") install.packages("wordcloud") install.packages("ggplot2") library(xlsx) library(tm) library(wordcloud) library(ggplot2) setwd("C://Users//608447283//desktop//R_word_charts") test <- Corpus(DirSource"C://Users//608447283//desktop//R

How to generate all n-grams in Hive

拜拜、爱过 提交于 2019-12-06 15:19:53
I'd like to create a list of n-grams using HiveQL. My idea was to use a regex with a lookahead and the split function - this does not work, though: select split('This is my sentence', '(\\S+) +(?=(\\S+))'); The input is a column of the form |sentence | |-------------------------| |This is my sentence | |This is another sentence | The output is supposed to be: ["This is","is my","my sentence"] ["This is","is another","another sentence"] There is an n-grams udf in Hive but the function directly calculates the frequency of the n-grams - I'd like to have a list of all the n-grams instead. Thanks a

N-gram模型

淺唱寂寞╮ 提交于 2019-12-06 14:42:39
N-gram模型 (一)引言 N-gram是自然语言处理中常见一种基于统计的语言模型 。 它的基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作,形成了长度是N的字节片段序列。每一个字节片段称为gram,在所给语句中对所有的gram出现的频数进行统计。再根据整体语料库中每个gram出现的频数进行比对可以得到所给语句中每个gram出现的概率。N-gram在判断句子合理性、句子相似度比较、分词等方面有突出的表现。 (二)朴素贝叶斯(Naive Bayes) 首先我们复习一下一个非常基本的模型,朴素贝叶斯(Naive Bayes)。朴素贝叶斯的关键组成是贝叶斯公式与条件独立性假设。可以参考( https://www.yuque.com/dadahuang/tvnnrr/gksobm )。为了方便说明,我们举一个垃圾短信分类的例子: 假如你的邮箱受到了一个垃圾邮件,里面的内容包含: “性感荷官在线发牌...” 根据朴素贝叶斯的目的是计算这句话属于垃圾短信敏感句子的概率。根据前面朴素贝叶斯的介绍,由 可得: P(垃圾短信|“性感荷官在线发牌”) 正相关于 P(垃圾邮件)P(“性感荷官在线发牌”|垃圾短信) 由条件独立性假设可得: P(“****性感荷官在线发牌****”|垃圾短信) = ****P("性","感","荷","官","在","线","发","牌"****|垃圾短信)

Multi-word Term Vectors with Word nGrams?

筅森魡賤 提交于 2019-12-06 08:27:26
问题 I'm aiming to build an index that, for each document, will break it down by word ngrams (uni, bi, and tri), then capture term vector analysis on all of those word ngrams. Is that possible with Elasticsearch? For instance, for a document field containing "The red car drives." I would be able to get the information: red - 1 instance car - 1 instance drives - 1 instance red car - 1 instance car drives - 1 instance red car drives - 1 instance Thanks in advance! 回答1: Assuming you already know

n-grams from text in python

余生颓废 提交于 2019-12-06 07:24:20
An update to my previous post , with some changes: Say that I have 100 tweets. In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction. I already have a lexicon with names, type and id-number: lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'drink', 'id': 'd_345'}, 'banana': {'type': 'food', 'id': 'f_456'}, 'banana split': {'type': 'food', 'id': 'f_567'}, 'cream': {'type': 'food', 'id': 'f_678'},

Is there a bi gram or tri gram feature in Spacy?

扶醉桌前 提交于 2019-12-06 06:11:38
问题 The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is benefiting major manufacturing companies") for token in doc: print(token.text) What I would ideally want is, to read 'cloud computing' together as it is technically one word. Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi

Rails sunspot-solr - words with hyphen

谁说胖子不能爱 提交于 2019-12-06 04:51:11
I'm using the sunspot_rails gem and everything is working perfect so far but: I'm not getting any search results for words with a hyphen. Example: The string "tron" returns a lot of results(the word mentioned in all articles is e-tron) The string "e-tron" returns 0 results even though this is the correct word mentioned in all my articles. My current schema.xml config: <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/

2-gram and 3-gram instead of 1-gram using RWeka

梦想的初衷 提交于 2019-12-05 18:58:48
I am trying to extract 1-gram, 2-gram and 3-gram from the train corpus, using RWeka NGramTokenizer function. Unfortunately, getting only 1-grams. There is my code: train_corpus # clean-up cleanset1<- tm_map(train_corpus, tolower) cleanset2<- tm_map(cleanset1, removeNumbers) cleanset3<- tm_map(cleanset2, removeWords, stopwords("english")) cleanset4<- tm_map(cleanset3, removePunctuation) cleanset5<- tm_map(cleanset4, stemDocument, language="english") cleanset6<- tm_map(cleanset5, stripWhitespace) # 1-gram NgramTokenizer1 <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1)) train_dtm

Is there a more efficient way to find most common n-grams?

ⅰ亾dé卋堺 提交于 2019-12-05 18:43:46
I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this? alvas In Python, using NLTK: $ wget http://norvig.com/big.txt $ python >>> from collections import Counter >>> from nltk import ngrams >>> bigtxt = open('big.txt').read() >>> ngram_counts = Counter(ngrams(bigtxt.split(), 2)) >>> ngram_counts.most_common(10) [(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065)

NGramTokenizer not working as expected

﹥>﹥吖頭↗ 提交于 2019-12-05 16:47:25
I have a simple R code where I'm reading text from a file and plotting recurring phrases on a bar chart. For some reason, the bar chart only shows single words rather than multi worded phrases. Where am I going wrong? install.packages("xlsx") install.packages("tm") install.packages("wordcloud") install.packages("ggplot2") library(xlsx) library(tm) library(wordcloud) library(ggplot2) setwd("C://Users//608447283//desktop//R_word_charts") test <- Corpus(DirSource"C://Users//608447283//desktop//R_word_charts//source")) test <- tm_map(test, stripWhitespace) test <- tm_map(test, tolower) test <- tm