term-document-matrix

Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

白昼怎懂夜的黑 提交于 2019-12-01 13:12:19
问题 I try to create a matrix, for this I would like to tolower text. For this I use this R instruction : matrix = create_matrix(tweets[,1], toLower = TRUE, language="english", removeStopwords=FALSE, removeNumbers=TRUE, stemWords=TRUE) Here the R code : library(RTextTools) library(e1071) pos_tweets = rbind( c('j AIME la voiture', 'positive'), c('cette machine est performante', 'positive'), c('je me sens en bonne forme ce matin', 'positive'), c('je suis super excitée d aller voir le spectacle de

R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

好久不见. 提交于 2019-12-01 12:03:05
问题 When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

允我心安 提交于 2019-11-30 07:31:15
Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would appreciate any insight on what this won't work with Corpus and if others have this same problem. #A

efficient Term Document Matrix with NLTK

不想你离开。 提交于 2019-11-29 23:57:56
I am trying to create a term document matrix with NLTK and pandas. I wrote the following function: def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in range(0, len(xCorpus.fileids())): fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x]))) DTM = pd.DataFrame(fd_list, index = xCorpus.fileids()) DTM.fillna(0,inplace = True) return DTM.T to run it import nltk from nltk.corpus import PlaintextCorpusReader corpus_root = 'C:/Data/' newcorpus = PlaintextCorpusReader(corpus_root, '.*') x = fnDTM_Corpus(newcorpus) It

Error converting text to lowercase with tm_map(…, tolower)

与世无争的帅哥 提交于 2019-11-29 23:11:48
I tried using the tm_map . It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character" Use the base R function tolower() : tolower(c("THE quick BROWN fox")) # [1] "the quick brown fox" daroczig Expanding my comment to a more detailed answer here: you have to wrap tolower inside of content_transformer not to screw up the VCorpus object -- something like: > library(tm) > data('crude') > crude[[1]]$content [1] "Diamond Shamrock Corp said that

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

孤人 提交于 2019-11-29 09:40:12
问题 Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would

Error converting text to lowercase with tm_map(…, tolower)

萝らか妹 提交于 2019-11-28 20:36:59
问题 I tried using the tm_map . It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character" 回答1: Use the base R function tolower() : tolower(c("THE quick BROWN fox")) # [1] "the quick brown fox" 回答2: Expanding my comment to a more detailed answer here: you have to wrap tolower inside of content_transformer not to screw up the VCorpus object --

More efficient means of creating a corpus and DTM with 4M rows

大憨熊 提交于 2019-11-28 16:35:24
My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")) doc.corpus <- tm_map(doc.corpus, stemDocument, "english") doc.corpus <- tm_map(doc.corpus,

findAssocs for multiple terms in R

半世苍凉 提交于 2019-11-28 10:34:58
In R I used the [tm package][1] for building a term-document matrix from a corpus of documents. My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking for a variable that holds all row.names from the matrix so the function findAssocs() can do his job. This is my code so far: library(tm) library(RWeka) txtData <- read.csv("file.csv", header = T, sep = ",") txtCorpus <- Corpus(VectorSource(txtData$text)) ...further preprocessing #Tokenizer for n-grams and passed on to the term-document matrix

More efficient means of creating a corpus and DTM with 4M rows

落花浮王杯 提交于 2019-11-27 19:56:41
问题 My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")