term-document-matrix | 易学教程

Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

阅读更多关于 Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

问题 I try to create a matrix, for this I would like to tolower text. For this I use this R instruction : matrix = create_matrix(tweets[,1], toLower = TRUE, language="english", removeStopwords=FALSE, removeNumbers=TRUE, stemWords=TRUE) Here the R code : library(RTextTools) library(e1071) pos_tweets = rbind( c('j AIME la voiture', 'positive'), c('cette machine est performante', 'positive'), c('je me sens en bonne forme ce matin', 'positive'), c('je suis super excitée d aller voir le spectacle de

R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

阅读更多关于 R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

问题 When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

阅读更多关于 Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would appreciate any insight on what this won't work with Corpus and if others have this same problem. #A

efficient Term Document Matrix with NLTK

阅读更多关于 efficient Term Document Matrix with NLTK

I am trying to create a term document matrix with NLTK and pandas. I wrote the following function: def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in range(0, len(xCorpus.fileids())): fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x]))) DTM = pd.DataFrame(fd_list, index = xCorpus.fileids()) DTM.fillna(0,inplace = True) return DTM.T to run it import nltk from nltk.corpus import PlaintextCorpusReader corpus_root = 'C:/Data/' newcorpus = PlaintextCorpusReader(corpus_root, '.*') x = fnDTM_Corpus(newcorpus) It

Error converting text to lowercase with tm_map(…, tolower)

阅读更多关于 Error converting text to lowercase with tm_map(…, tolower)

I tried using the tm_map . It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character" Use the base R function tolower() : tolower(c("THE quick BROWN fox")) # [1] "the quick brown fox" daroczig Expanding my comment to a more detailed answer here: you have to wrap tolower inside of content_transformer not to screw up the VCorpus object -- something like: > library(tm) > data('crude') > crude[[1]]$content [1] "Diamond Shamrock Corp said that

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

阅读更多关于 Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

问题 Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would

Error converting text to lowercase with tm_map(…, tolower)

阅读更多关于 Error converting text to lowercase with tm_map(…, tolower)

问题 I tried using the tm_map . It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character" 回答1: Use the base R function tolower() : tolower(c("THE quick BROWN fox")) # [1] "the quick brown fox" 回答2: Expanding my comment to a more detailed answer here: you have to wrap tolower inside of content_transformer not to screw up the VCorpus object --

More efficient means of creating a corpus and DTM with 4M rows

阅读更多关于 More efficient means of creating a corpus and DTM with 4M rows

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")) doc.corpus <- tm_map(doc.corpus, stemDocument, "english") doc.corpus <- tm_map(doc.corpus,

findAssocs for multiple terms in R

阅读更多关于 findAssocs for multiple terms in R

In R I used the [tm package][1] for building a term-document matrix from a corpus of documents. My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking for a variable that holds all row.names from the matrix so the function findAssocs() can do his job. This is my code so far: library(tm) library(RWeka) txtData <- read.csv("file.csv", header = T, sep = ",") txtCorpus <- Corpus(VectorSource(txtData$text)) ...further preprocessing #Tokenizer for n-grams and passed on to the term-document matrix

More efficient means of creating a corpus and DTM with 4M rows

阅读更多关于 More efficient means of creating a corpus and DTM with 4M rows

问题 My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")