tm

tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

北城以北 提交于 2019-12-05 11:16:21
I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB. > corp A corpus with 1859 text documents > mat<-DocumentTermMatrix(corp) > dim(mat) [1] 1859 25722 > is(mat) [1] "DocumentTermMatrix" > mat2<-as.matrix(mat) Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB > object.size(mat) 5502000 bytes

Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

ぃ、小莉子 提交于 2019-12-05 07:00:13
问题 I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set! trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set. here's my (not so beautiful) code: # create data.frame with labelled sentences labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T)) # create training set and test set traindata <- as.data.frame(labeled[1:700,c(

Snowball Stemmer only stems last word

最后都变了- 提交于 2019-12-05 06:05:08
I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed. library(tm) library(Snowball) library(RWeka) library(rJava) path <- c("C:/path/to/diretory") corp <- Corpus(DirSource(path), readerControl = list(reader = readPlain, language = "en_US", load = TRUE)) tm_map(corp,SnowballStemmer) #stemDocument has the same problem I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples: >

Remove meaningless words from corpus in R

你说的曾经没有我的故事 提交于 2019-12-05 02:38:53
问题 I am using tm and wordcloud for performing some basic text mining in R. The text being processed contains many words which are meaningless like asfdg,aawptkr and i need to filter such words. The closest solution i have found is using library(qdapDictionaries) and building a custom function to check validity of words. library(qdapDictionaries) is.word <- function(x) x %in% GradyAugmented # example > is.word("aapg") [1] FALSE The rest of text mining used is : curDir <- "E:/folder1/" # folder1

tm loses the metadata when applying tm_map

回眸只為那壹抹淺笑 提交于 2019-12-05 02:24:35
问题 I have a (small) problem with the tm r library. say I have a corpus: # boilerplate bcorp <- c("one","two","three","four","five") myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US")) tdm <- TermDocumentMatrix(myCorpus) Docs(tdm) Result: [1] "1" "2" "3" "4" "5" This works. But when I try to use a transformation tm_map(): # this does not work myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US")) myCorpus <- tm_map(myCorpus, tolower) tdm <- TermDocumentMatrix(myCorpus)

Removing stopwords from a user-defined corpus in R

你说的曾经没有我的故事 提交于 2019-12-05 02:03:51
问题 I have a set of documents: documents = c("She had toast for breakfast", "The coffee this morning was excellent", "For lunch let's all have pancakes", "Later in the day, there will be more talks", "The talks on the first day were great", "The second day should have good presentations too") In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using: documents = tolower(documents) #make it lower case documents = gsub('[[

FUN-error after running 'tolower' while making Twitter wordcloud

Deadly 提交于 2019-12-05 00:53:07
问题 Trying to create wordcloud from twitter data, but get the following error: Error in FUN(X[[72L]], ...) : invalid input '������������❤������������ "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs' This error appears after running the "mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, tolower)" code mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText()) mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list)) mytwittersearch_corpus<-tm_map

tm custom removePunctuation except hashtag

北城以北 提交于 2019-12-05 00:47:50
问题 I have a Corpus of tweets from twitter. I clean this corpus (removeWords, tolower, delete URls) and finally also want to remove punctuation. Here is my code: tweetCorpus <- tm_map(tweetCorpus, removePunctuation, preserve_intra_word_dashes = TRUE) The problem now is, that by doing so I also loose the hashtag (#). Is there a way to remove punctuation with tm_map but remain the hashtag? 回答1: You could adapt the existing removePunctuation to suit your needs. For example removeMostPunctuation<-

Importing pdf in R through package “tm”

折月煮酒 提交于 2019-12-04 20:41:05
I know the practical example to get pdf in "R" workspace through package "tm" but not able to understand how the code is working and thus not able to import the desired pdf. The pdf imported in the following code is "tm" vignette. The code is if(file.exists(Sys.which("pdftotext"))) { pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = vignette("tm")$pdf), language = "en", id = "id1") pdf[1:13] } The "tm" is vignette. While the pdf which I am trying to bring is "different". So how to change the above code to bring my pdf in the workspace. minn is the pdf document which I am trying to

A lemmatizing function using a hash dictionary does not work with tm package in R

孤街醉人 提交于 2019-12-04 20:18:40
I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code: library(hashmap) library(data.table) txt = "Abadan Abadanem Abadan