tm | 易学教程

In R tm package, build corpus FROM Document-Term-Matrix

阅读更多关于 In R tm package, build corpus FROM Document-Term-Matrix

问题 It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to

Inconsistent behaviour with tm_map transformation functions when using multiple cores

阅读更多关于 Inconsistent behaviour with tm_map transformation functions when using multiple cores

Another potential title for this post could be "When parallel processing in r, does the ratio between number of cores, loop chunk size and object size matter?" I have a corpus I am running some transformations on using tm package. Since the corpus is large I'm using parallel processing with doparallel package. Sometimes the transformations do the task, but sometimes they do not. For example, tm::removeNumbers() . The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this doc will be transformed to just "n". Sample corpus is below for

Replace words in corpus according to dictionary data frame

阅读更多关于 Replace words in corpus according to dictionary data frame

I am interested in replacing all words in a tm Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word. I am stuck with the translate function. I saw this answer but I can't transform it in a function to be passed to tm_map . Please consider the following MWE library(tm) docs <- c("first text", "second text") corp <- Corpus(VectorSource(docs)) dictionary <- data.frame(word = c('first', 'second', 'text'), translation = c('primo', 'secondo', 'testo')) translate <- function(text,

transformation drops documents error in R

阅读更多关于 transformation drops documents error in R

Whenever i run this code, tm_map line give me warning message as Warning message: In tm_map.SimpleCorpus(docs, toSpace, "/") : transformation drops documents texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE) docs <- Corpus(VectorSource(texts)) toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords,

R, tm-error of transformation drops documents

阅读更多关于 R, tm-error of transformation drops documents

I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <- tm_map(corp, stripWhitespace) Warning message: In tm_map.SimpleCorpus(corp, stripWhitespace) : transformation drops documents corp <- tm_map(corp, tolower) Warning message: In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents The codes were working 2 months ago, now I'm trying for a new data and it is not working anymore. Anyone please shows

transformation drops documents error in R

阅读更多关于 transformation drops documents error in R

问题 Whenever i run this code, tm_map line give me warning message as Warning message: In tm_map.SimpleCorpus(docs, toSpace, "/") : transformation drops documents texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE) docs <- Corpus(VectorSource(texts)) toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") docs <- tm_map(docs,

How to determine which older version of the R package is compatible with my R version

阅读更多关于 How to determine which older version of the R package is compatible with my R version

问题 I am trying to install the "tm" package but then I get an error saying that "tm" is not available for my R version package ‘tm’ is not available (for R version 3.0.2) But then I saw that someone suggested I download the archived version from http://cran.r-project.org/src/contrib/Archive/tm/?C=M;O=A and then try installing from source. My question is how do I determine which file there in the list is compatible with my R version? 回答1: You can use the METACRAN mirror: Go to the blame page of

How to write custom removePunctuation() function to better deal with Unicode chars?

阅读更多关于 How to write custom removePunctuation() function to better deal with Unicode chars?

In the source code of the tm text-mining R-package, in file transform.R , there is the removePunctuation() function, currently defined as: function(x, preserve_intra_word_dashes = FALSE) { if (!preserve_intra_word_dashes) gsub("[[:punct:]]+", "", x) else { # Assume there are no ASCII 1 characters. x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x) x <- gsub("[[:punct:]]+", "", x) gsub("\1", "-", x, fixed = TRUE) } } I need to parse and mine some abstracts from a science conference (fetched from their website as UTF-8). The abstracts contain some unicode characters that need to be removed, particularly at

Adding custom stopwords in R tm

阅读更多关于 Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) Reza Rahimi Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header = FALSE) stopwords <- as.character(stopwords$V1) stopwords <- c(stopwords, stopwords()) Then you can

How to determine which older version of the R package is compatible with my R version

阅读更多关于 How to determine which older version of the R package is compatible with my R version

I am trying to install the "tm" package but then I get an error saying that "tm" is not available for my R version package ‘tm’ is not available (for R version 3.0.2) But then I saw that someone suggested I download the archived version from http://cran.r-project.org/src/contrib/Archive/tm/?C=M;O=A and then try installing from source. My question is how do I determine which file there in the list is compatible with my R version? You can use the METACRAN mirror : Go to the blame page of the DESCRIPTION file of the package you're interested in. E.g. for tm : https://github.com/cran/tm/blame