tm

tm Package error: Error definining Document Term Matrix

心已入冬 提交于 2019-12-21 21:19:20
问题 I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message: Error in UseMethod("Content", x) : no applicable method for 'Content' applied to an object of class "character" All pre-processing steps work

Use tm's Corpus function with big data in R

百般思念 提交于 2019-12-21 04:48:06
问题 I'm trying to do text mining on big data in R with tm . I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as using 64-bit R trying different OS's (Windows, Linux, Solaris, etc) setting memory.limit() to its maximum making sure that sufficient RAM and compute is available on the server (which there is) making liberal use of gc() profiling the code for bottlenecks breaking up big operations

Does tm automatically ignore the very short strings?

时光毁灭记忆、已成空白 提交于 2019-12-20 07:26:42
问题 Here is my code: example 1: a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi") a1 <- VCorpus(VectorSource(a)) a2 <- TermDocumentMatrix(a1,control = list(stemming=T)) inspect(a2) The result is: Docs Terms 1 2 12v 0 1 a23 0 1 alkalin 0 1 batteri 0 1 energ 0 1 Looks like the first string in a is ignored. example 2 a <- c("abcd cde de","ENERGIZER A23 12V ALKALINE BATTERi") a1 <- VCorpus(VectorSource(a)) a2 <- TermDocumentMatrix(a1,control = list(stemming=T)) inspect(a2) The result is: Docs

Replace words in corpus according to dictionary data frame

可紊 提交于 2019-12-19 04:56:31
问题 I am interested in replacing all words in a tm Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word. I am stuck with the translate function. I saw this answer but I can't transform it in a function to be passed to tm_map . Please consider the following MWE library(tm) docs <- c("first text", "second text") corp <- Corpus(VectorSource(docs)) dictionary <- data.frame(word = c(

how do I create a corpus of *.docx files with tm?

余生颓废 提交于 2019-12-19 04:14:56
问题 I have a mixed filetype collection of MS Word documents. Some files are *.doc and some are *.docx. I'm learning to use tm and I've (more or less*) successfully created a corpus composed of the *.doc files using this: ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'), readerControl=list(reader=readDOC, language='en_CA', load=TRUE)); This command does not handle *.docx files. I assume that I need a different reader. From this article, I understand that I could write my own (given a

R remove stopwords from a character vector using %in%

大憨熊 提交于 2019-12-19 03:44:23
问题 I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword dictionary. library(plyr) library(tm) stopWords <- stopwords("en") class(stopWords) df1 <- data.frame(id = seq(1,5,1), string1 = NA) head(df1) df1$string1[1] <- "This string is a string." df1$string1[2] <- "This string is a slightly longer string." df1$string1[3] <- "This string is an even

R, tm-error of transformation drops documents

陌路散爱 提交于 2019-12-18 23:02:29
问题 I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <- tm_map(corp, stripWhitespace) Warning message: In tm_map.SimpleCorpus(corp, stripWhitespace) : transformation drops documents corp <- tm_map(corp, tolower) Warning message: In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents The codes were

R, tm-error of transformation drops documents

痞子三分冷 提交于 2019-12-18 23:01:27
问题 I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <- tm_map(corp, stripWhitespace) Warning message: In tm_map.SimpleCorpus(corp, stripWhitespace) : transformation drops documents corp <- tm_map(corp, tolower) Warning message: In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents The codes were

Removing non-English text from Corpus in R using tm()

会有一股神秘感。 提交于 2019-12-18 11:32:39
问题 I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables. Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this: Special satisfação Happy Sad Potential für I then read my txt file into R: words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language =

Use R to convert PDF files to text files for text mining

廉价感情. 提交于 2019-12-18 10:24:31
问题 I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe" system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F) # get txt-file name and open it filetxt <- sub(".pdf", ".txt", dest) shell.exec(filetxt) By this, I am converting one pdf file to