tm | 易学教程

tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

阅读更多关于 tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB. > corp A corpus with 1859 text documents > mat<-DocumentTermMatrix(corp) > dim(mat) [1] 1859 25722 > is(mat) [1] "DocumentTermMatrix" > mat2<-as.matrix(mat) Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB > object.size(mat) 5502000 bytes

Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

阅读更多关于 Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

问题 I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set! trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set. here's my (not so beautiful) code: # create data.frame with labelled sentences labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T)) # create training set and test set traindata <- as.data.frame(labeled[1:700,c(

Snowball Stemmer only stems last word

阅读更多关于 Snowball Stemmer only stems last word

I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed. library(tm) library(Snowball) library(RWeka) library(rJava) path <- c("C:/path/to/diretory") corp <- Corpus(DirSource(path), readerControl = list(reader = readPlain, language = "en_US", load = TRUE)) tm_map(corp,SnowballStemmer) #stemDocument has the same problem I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples: >

Remove meaningless words from corpus in R

阅读更多关于 Remove meaningless words from corpus in R

问题 I am using tm and wordcloud for performing some basic text mining in R. The text being processed contains many words which are meaningless like asfdg,aawptkr and i need to filter such words. The closest solution i have found is using library(qdapDictionaries) and building a custom function to check validity of words. library(qdapDictionaries) is.word <- function(x) x %in% GradyAugmented # example > is.word("aapg") [1] FALSE The rest of text mining used is : curDir <- "E:/folder1/" # folder1

tm loses the metadata when applying tm_map

阅读更多关于 tm loses the metadata when applying tm_map

问题 I have a (small) problem with the tm r library. say I have a corpus: # boilerplate bcorp <- c("one","two","three","four","five") myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US")) tdm <- TermDocumentMatrix(myCorpus) Docs(tdm) Result: [1] "1" "2" "3" "4" "5" This works. But when I try to use a transformation tm_map(): # this does not work myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US")) myCorpus <- tm_map(myCorpus, tolower) tdm <- TermDocumentMatrix(myCorpus)

Removing stopwords from a user-defined corpus in R

阅读更多关于 Removing stopwords from a user-defined corpus in R

问题 I have a set of documents: documents = c("She had toast for breakfast", "The coffee this morning was excellent", "For lunch let's all have pancakes", "Later in the day, there will be more talks", "The talks on the first day were great", "The second day should have good presentations too") In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using: documents = tolower(documents) #make it lower case documents = gsub('[[

FUN-error after running 'tolower' while making Twitter wordcloud

阅读更多关于 FUN-error after running 'tolower' while making Twitter wordcloud

问题 Trying to create wordcloud from twitter data, but get the following error: Error in FUN(X[[72L]], ...) : invalid input '��❤�� "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs' This error appears after running the "mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, tolower)" code mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText()) mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list)) mytwittersearch_corpus<-tm_map

tm custom removePunctuation except hashtag

阅读更多关于 tm custom removePunctuation except hashtag

问题 I have a Corpus of tweets from twitter. I clean this corpus (removeWords, tolower, delete URls) and finally also want to remove punctuation. Here is my code: tweetCorpus <- tm_map(tweetCorpus, removePunctuation, preserve_intra_word_dashes = TRUE) The problem now is, that by doing so I also loose the hashtag (#). Is there a way to remove punctuation with tm_map but remain the hashtag? 回答1: You could adapt the existing removePunctuation to suit your needs. For example removeMostPunctuation<-

Importing pdf in R through package “tm”

阅读更多关于 Importing pdf in R through package “tm”

I know the practical example to get pdf in "R" workspace through package "tm" but not able to understand how the code is working and thus not able to import the desired pdf. The pdf imported in the following code is "tm" vignette. The code is if(file.exists(Sys.which("pdftotext"))) { pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = vignette("tm")$pdf), language = "en", id = "id1") pdf[1:13] } The "tm" is vignette. While the pdf which I am trying to bring is "different". So how to change the above code to bring my pdf in the workspace. minn is the pdf document which I am trying to

A lemmatizing function using a hash dictionary does not work with tm package in R

阅读更多关于 A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code: library(hashmap) library(data.table) txt = "Abadan Abadanem Abadan