tm

In R tm package, build corpus FROM Document-Term-Matrix

て烟熏妆下的殇ゞ 提交于 2019-12-01 05:27:16
问题 It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to

Inconsistent behaviour with tm_map transformation functions when using multiple cores

折月煮酒 提交于 2019-12-01 02:11:06
Another potential title for this post could be "When parallel processing in r, does the ratio between number of cores, loop chunk size and object size matter?" I have a corpus I am running some transformations on using tm package. Since the corpus is large I'm using parallel processing with doparallel package. Sometimes the transformations do the task, but sometimes they do not. For example, tm::removeNumbers() . The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this doc will be transformed to just "n". Sample corpus is below for

Replace words in corpus according to dictionary data frame

时光毁灭记忆、已成空白 提交于 2019-12-01 01:10:10
I am interested in replacing all words in a tm Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word. I am stuck with the translate function. I saw this answer but I can't transform it in a function to be passed to tm_map . Please consider the following MWE library(tm) docs <- c("first text", "second text") corp <- Corpus(VectorSource(docs)) dictionary <- data.frame(word = c('first', 'second', 'text'), translation = c('primo', 'secondo', 'testo')) translate <- function(text,

transformation drops documents error in R

老子叫甜甜 提交于 2019-12-01 00:44:36
Whenever i run this code, tm_map line give me warning message as Warning message: In tm_map.SimpleCorpus(docs, toSpace, "/") : transformation drops documents texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE) docs <- Corpus(VectorSource(texts)) toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords,

R, tm-error of transformation drops documents

烈酒焚心 提交于 2019-11-30 20:20:50
I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <- tm_map(corp, stripWhitespace) Warning message: In tm_map.SimpleCorpus(corp, stripWhitespace) : transformation drops documents corp <- tm_map(corp, tolower) Warning message: In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents The codes were working 2 months ago, now I'm trying for a new data and it is not working anymore. Anyone please shows

transformation drops documents error in R

安稳与你 提交于 2019-11-30 17:32:34
问题 Whenever i run this code, tm_map line give me warning message as Warning message: In tm_map.SimpleCorpus(docs, toSpace, "/") : transformation drops documents texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE) docs <- Corpus(VectorSource(texts)) toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") docs <- tm_map(docs,

How to determine which older version of the R package is compatible with my R version

廉价感情. 提交于 2019-11-30 14:38:49
问题 I am trying to install the "tm" package but then I get an error saying that "tm" is not available for my R version package ‘tm’ is not available (for R version 3.0.2) But then I saw that someone suggested I download the archived version from http://cran.r-project.org/src/contrib/Archive/tm/?C=M;O=A and then try installing from source. My question is how do I determine which file there in the list is compatible with my R version? 回答1: You can use the METACRAN mirror: Go to the blame page of

How to write custom removePunctuation() function to better deal with Unicode chars?

非 Y 不嫁゛ 提交于 2019-11-30 14:21:46
In the source code of the tm text-mining R-package, in file transform.R , there is the removePunctuation() function, currently defined as: function(x, preserve_intra_word_dashes = FALSE) { if (!preserve_intra_word_dashes) gsub("[[:punct:]]+", "", x) else { # Assume there are no ASCII 1 characters. x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x) x <- gsub("[[:punct:]]+", "", x) gsub("\1", "-", x, fixed = TRUE) } } I need to parse and mine some abstracts from a science conference (fetched from their website as UTF-8). The abstracts contain some unicode characters that need to be removed, particularly at

Adding custom stopwords in R tm

假如想象 提交于 2019-11-30 11:46:50
I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) Reza Rahimi Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header = FALSE) stopwords <- as.character(stopwords$V1) stopwords <- c(stopwords, stopwords()) Then you can

How to determine which older version of the R package is compatible with my R version

六月ゝ 毕业季﹏ 提交于 2019-11-30 11:12:00
I am trying to install the "tm" package but then I get an error saying that "tm" is not available for my R version package ‘tm’ is not available (for R version 3.0.2) But then I saw that someone suggested I download the archived version from http://cran.r-project.org/src/contrib/Archive/tm/?C=M;O=A and then try installing from source. My question is how do I determine which file there in the list is compatible with my R version? You can use the METACRAN mirror : Go to the blame page of the DESCRIPTION file of the package you're interested in. E.g. for tm : https://github.com/cran/tm/blame