tm | 易学教程

tm Package error: Error definining Document Term Matrix

阅读更多关于 tm Package error: Error definining Document Term Matrix

问题 I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message: Error in UseMethod("Content", x) : no applicable method for 'Content' applied to an object of class "character" All pre-processing steps work

Use tm's Corpus function with big data in R

阅读更多关于 Use tm's Corpus function with big data in R

问题 I'm trying to do text mining on big data in R with tm . I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as using 64-bit R trying different OS's (Windows, Linux, Solaris, etc) setting memory.limit() to its maximum making sure that sufficient RAM and compute is available on the server (which there is) making liberal use of gc() profiling the code for bottlenecks breaking up big operations

Does tm automatically ignore the very short strings?

阅读更多关于 Does tm automatically ignore the very short strings?

问题 Here is my code: example 1: a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi") a1 <- VCorpus(VectorSource(a)) a2 <- TermDocumentMatrix(a1,control = list(stemming=T)) inspect(a2) The result is: Docs Terms 1 2 12v 0 1 a23 0 1 alkalin 0 1 batteri 0 1 energ 0 1 Looks like the first string in a is ignored. example 2 a <- c("abcd cde de","ENERGIZER A23 12V ALKALINE BATTERi") a1 <- VCorpus(VectorSource(a)) a2 <- TermDocumentMatrix(a1,control = list(stemming=T)) inspect(a2) The result is: Docs

Replace words in corpus according to dictionary data frame

阅读更多关于 Replace words in corpus according to dictionary data frame

问题 I am interested in replacing all words in a tm Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word. I am stuck with the translate function. I saw this answer but I can't transform it in a function to be passed to tm_map . Please consider the following MWE library(tm) docs <- c("first text", "second text") corp <- Corpus(VectorSource(docs)) dictionary <- data.frame(word = c(

how do I create a corpus of *.docx files with tm?

阅读更多关于 how do I create a corpus of *.docx files with tm?

问题 I have a mixed filetype collection of MS Word documents. Some files are *.doc and some are *.docx. I'm learning to use tm and I've (more or less*) successfully created a corpus composed of the *.doc files using this: ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'), readerControl=list(reader=readDOC, language='en_CA', load=TRUE)); This command does not handle *.docx files. I assume that I need a different reader. From this article, I understand that I could write my own (given a

R remove stopwords from a character vector using %in%

阅读更多关于 R remove stopwords from a character vector using %in%

问题 I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword dictionary. library(plyr) library(tm) stopWords <- stopwords("en") class(stopWords) df1 <- data.frame(id = seq(1,5,1), string1 = NA) head(df1) df1$string1[1] <- "This string is a string." df1$string1[2] <- "This string is a slightly longer string." df1$string1[3] <- "This string is an even

R, tm-error of transformation drops documents

阅读更多关于 R, tm-error of transformation drops documents

问题 I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <- tm_map(corp, stripWhitespace) Warning message: In tm_map.SimpleCorpus(corp, stripWhitespace) : transformation drops documents corp <- tm_map(corp, tolower) Warning message: In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents The codes were

R, tm-error of transformation drops documents

阅读更多关于 R, tm-error of transformation drops documents

Removing non-English text from Corpus in R using tm()

阅读更多关于 Removing non-English text from Corpus in R using tm()

问题 I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables. Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this: Special satisfação Happy Sad Potential für I then read my txt file into R: words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language =

Use R to convert PDF files to text files for text mining

阅读更多关于 Use R to convert PDF files to text files for text mining

问题 I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe" system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F) # get txt-file name and open it filetxt <- sub(".pdf", ".txt", dest) shell.exec(filetxt) By this, I am converting one pdf file to