n-gram

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

允我心安 提交于 2019-11-30 07:31:15
Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would appreciate any insight on what this won't work with Corpus and if others have this same problem. #A

Finding ngrams in R and comparing ngrams across corpora

有些话、适合烂在心里 提交于 2019-11-30 06:56:52
I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement"). This is a two-step question, one regarding my code so far and one regarding how I should go on. Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing: library(tm) library(RWeka) a <-Corpus(DirSource("/mycorpora/1965"),

Find the most frequently occuring words in a text in R

别等时光非礼了梦想. 提交于 2019-11-30 04:08:01
Can someone help me with how to find the most frequently used two and three words in a text using R? My text is... text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase

Bytes vs Characters vs Words - which granularity for n-grams?

[亡魂溺海] 提交于 2019-11-30 03:55:01
问题 At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? 回答1:

Python NLTK: Bigrams trigrams fourgrams

旧城冷巷雨未停 提交于 2019-11-30 03:25:30
I have this example and i want to know how to get this result. I have text and I tokenize it then I collect the bigram and trigram and fourgram like that import nltk from nltk import word_tokenize from nltk.util import ngrams text = "Hi How are you? i am fine and you" token=nltk.word_tokenize(text) bigrams=ngrams(token,2) bigrams: [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] trigrams=ngrams(token,3) trigrams: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i',

counting n-gram frequency in python nltk

孤街浪徒 提交于 2019-11-29 23:01:55
I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class. import nltk from nltk.collocations import * line = "" open_file = open('a_text_file','r') for val in open_file: line += val tokens = line.split() bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from

NLTK package to estimate the (unigram) perplexity

为君一笑 提交于 2019-11-29 14:32:12
问题 I am trying to calculate the perplexity for the data I have. The code I am using is: import sys sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk") from nltk.corpus import brown from nltk.model import NgramModel from nltk.probability import LidstoneProbDist, WittenBellProbDist estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) lm = NgramModel(3, brown.words(categories='news'), True, False, estimator) print lm But I am receiving the error, File "/usr/local

How to implement a spectrum kernel function in MATLAB?

会有一股神秘感。 提交于 2019-11-29 11:36:59
A spectrum kernel function operates on strings by counting the same n-grams in between two strings. For example, 'tool' has three 2-grams ('to', 'oo', and 'ol'), and the similarity between 'tool' and 'fool' is 2. ('oo' and 'ol' in common). How can I write a MATLAB function that could calculate this metric? The first step would be to create a function that can generate an n-gram for a given string. One way to do this in a vectorized fashion is with some clever indexing. function [subStrings, counts] = n_gram(fullString, N) if (N == 1) [subStrings, ~, index] = unique(cellstr(fullString.')); %.'#

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

孤人 提交于 2019-11-29 09:40:12
问题 Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would

Finding ngrams in R and comparing ngrams across corpora

杀马特。学长 韩版系。学妹 提交于 2019-11-29 07:58:41
问题 I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement"). This is a two-step question, one regarding my code so far and one regarding how I should go on. Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early