tm

Why isn't stemDocument stemming?

走远了吗. 提交于 2019-12-06 16:19:40
I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it. Here is the script for the process, which uses a couple of online news stories as the sandbox: library(boilerpipeR) library(RCurl) library(tm) # Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl' url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/" extract <-

Remove stopwords and tolower function slow on a Corpus in R

无人久伴 提交于 2019-12-06 15:02:11
I have corpus roughly with 75 MB data. I am trying to use the following command tm_map(doc.corpus, removeWords, stopwords("english")) tm_map(doc.corpus, tolower) This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model. I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed. I have a system with 4GB RAM and running a local database to read the input data. Hoping for suggestions to speed up! Maybe you can give quanteda a try library(stringi)

Importing pdf in R through package “tm”

柔情痞子 提交于 2019-12-06 14:57:41
问题 I know the practical example to get pdf in "R" workspace through package "tm" but not able to understand how the code is working and thus not able to import the desired pdf. The pdf imported in the following code is "tm" vignette. The code is if(file.exists(Sys.which("pdftotext"))) { pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = vignette("tm")$pdf), language = "en", id = "id1") pdf[1:13] } The "tm" is vignette. While the pdf which I am trying to bring is "different". So how

Turn Unicode into Umlaut in R on Mac (Facebook Data)

本小妞迷上赌 提交于 2019-12-06 12:13:47
I did a lot of research on this and I still can't find a solution to this. I have extracted data from a German Facebook group that looks like from_ID from_name message created_time 12334543 Max Muster Dies war auch eine sehr sch<U+00F6>ne Bucht 2016-01-08T19:00:54+0000 I understand that <U+00F6> stands for the German Umlat ö. There are many other examples of Unicode replacing German Umlaute or other language specifc signs (no matter which language). No matter if I want to do a sentiment analysis or just produce a wordcloud I sometimes have issues with this. In case of the sentiment an issue is

How to keep the beginning and end of sentence markers with quanteda

a 夏天 提交于 2019-12-06 12:08:25
问题 I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below. I thought that using the keptFeatures with a regular expression that matched those should maintain them but the chevron markers are always removed. How can I keep the chevron markers from being removed or what is the best way to delimit beginning and end of sentence with quanteda ? As a bonus question what is

R Text mining - how to change texts in R data frame column into several columns with word frequencies?

牧云@^-^@ 提交于 2019-12-06 12:00:47
问题 I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels. Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts. I tried some functions of the tm package but

Error in UseMethod(“meta”, x) : no applicable method for 'try-error' applied to an object of class “character”

…衆ロ難τιáo~ 提交于 2019-12-06 11:35:40
问题 I am using tm package in R to do stemming in my corpus. However, I got a problem when I ran DocumentTermMartix "Error in UseMethod("meta", x) : no applicable method for 'try-error' applied to an object of class "character" here is my workflow: library(tm) myCorpus <- Corpus(VectorSource(training$FullDescription)) myCorpus <- tm_map(myCorpus, content_transformer(tolower), lazy=TRUE) myCorpus <- tm_map(myCorpus, removePunctuation, lazy=TRUE) myCorpus <- tm_map(myCorpus, removeNumbers, lazy=TRUE

R: tm Textmining package: Doc-Level metadata generation is slow

纵饮孤独 提交于 2019-12-06 10:51:45
I have a list of documents to process, and for each record I want to attach some metadata to the document "member" inside the "corpus" data structure that tm, the R package, generates (from reading in text files). This for-loop works but it is very slow , Performance seems to degrade as a function f ~ 1/n_docs. for (i in seq(from= 1, to=length(corpus), by=1)){ if(opts$options$verbose == TRUE || i %% 50 == 0){ print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " ")) } DublinCore(corpus[[i]], "title") = csv[[i,10]] DublinCore(corpus[[i]], "Publisher" ) = csv[[i,16]] #institutions } This may

Making a wordcloud, but with combined words?

我与影子孤独终老i 提交于 2019-12-06 10:16:44
问题 I am trying to make a word cloud of publications keywords. for example: Educational data mining; collaborative learning; computer science...etc My current code is as the following: KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords, Words$Year==2012))) KeywordsCorpus <- tm_map(KeywordsCorpus, removePunctuation) KeywordsCorpus <- tm_map(KeywordsCorpus, removeNumbers) # added tolower KeywordsCorpus <- tm_map(KeywordsCorpus, tolower) KeywordsCorpus <- tm_map(KeywordsCorpus,

Does tm package itself provide a built-in way to combine document-term matrices?

烂漫一生 提交于 2019-12-06 09:53:38
问题 Does tm package itself provide a built-in way to combine document-term matrices? I generated 4 document term matrices on the same corpus, each for 1,2,3,4 gram. They are all really big: 200k*10k so converting them into data frames and then cbinding them is out of question. I know I could write a program recording the non-zero elements in each of the matrices and build a sparse-matrix, but that is a lot of trouble. It just seems natural for tm package to provide this functionality. So if it