n-gram | 易学教程

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

阅读更多关于 Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would appreciate any insight on what this won't work with Corpus and if others have this same problem. #A

Finding ngrams in R and comparing ngrams across corpora

阅读更多关于 Finding ngrams in R and comparing ngrams across corpora

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement"). This is a two-step question, one regarding my code so far and one regarding how I should go on. Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing: library(tm) library(RWeka) a <-Corpus(DirSource("/mycorpora/1965"),

Find the most frequently occuring words in a text in R

阅读更多关于 Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R? My text is... text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase

Bytes vs Characters vs Words - which granularity for n-grams?

阅读更多关于 Bytes vs Characters vs Words - which granularity for n-grams?

问题 At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? 回答1:

Python NLTK: Bigrams trigrams fourgrams

阅读更多关于 Python NLTK: Bigrams trigrams fourgrams

I have this example and i want to know how to get this result. I have text and I tokenize it then I collect the bigram and trigram and fourgram like that import nltk from nltk import word_tokenize from nltk.util import ngrams text = "Hi How are you? i am fine and you" token=nltk.word_tokenize(text) bigrams=ngrams(token,2) bigrams: [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] trigrams=ngrams(token,3) trigrams: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i',

counting n-gram frequency in python nltk

阅读更多关于 counting n-gram frequency in python nltk

I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class. import nltk from nltk.collocations import * line = "" open_file = open('a_text_file','r') for val in open_file: line += val tokens = line.split() bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from

NLTK package to estimate the (unigram) perplexity

阅读更多关于 NLTK package to estimate the (unigram) perplexity

问题 I am trying to calculate the perplexity for the data I have. The code I am using is: import sys sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk") from nltk.corpus import brown from nltk.model import NgramModel from nltk.probability import LidstoneProbDist, WittenBellProbDist estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) lm = NgramModel(3, brown.words(categories='news'), True, False, estimator) print lm But I am receiving the error, File "/usr/local

How to implement a spectrum kernel function in MATLAB?

阅读更多关于 How to implement a spectrum kernel function in MATLAB?

A spectrum kernel function operates on strings by counting the same n-grams in between two strings. For example, 'tool' has three 2-grams ('to', 'oo', and 'ol'), and the similarity between 'tool' and 'fool' is 2. ('oo' and 'ol' in common). How can I write a MATLAB function that could calculate this metric? The first step would be to create a function that can generate an n-gram for a given string. One way to do this in a vectorized fashion is with some clever indexing. function [subStrings, counts] = n_gram(fullString, N) if (N == 1) [subStrings, ~, index] = unique(cellstr(fullString.')); %.'#

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

阅读更多关于 Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

问题 Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would

Finding ngrams in R and comparing ngrams across corpora

阅读更多关于 Finding ngrams in R and comparing ngrams across corpora

问题 I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement"). This is a two-step question, one regarding my code so far and one regarding how I should go on. Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early