tm

R tm package create matrix of Nmost frequent terms

岁酱吖の 提交于 2019-12-09 11:10:42
问题 I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B

How to scrape web content and then count frequencies of words in R?

吃可爱长大的小学妹 提交于 2019-12-09 07:33:31
问题 This is my code: library(XML) library(RCurl) url.link <- 'http://www.jamesaltucher.com/sitemap.xml' blog <- getURL(url.link) blog <- htmlParse(blog, encoding = "UTF-8") titles <- xpathSApply (blog ,"//loc",xmlValue) ## titles traverse_each_page <- function(x){ tmp <- htmlParse(x) xpathApply(tmp, '//div[@id="mainContent"]') } pages <- lapply(titles[2:3], traverse_each_page) Here is the pseudocode: Take a xml document: http://www.jamesaltucher.com/sitemap.xml Go to each link Parse the html

How to implement proximity rules in tm dictionary for counting words?

≯℡__Kan透↙ 提交于 2019-12-09 07:18:12
问题 Objective I would like to count the number of times the word "love" appears in a documents but only if it isn't preceded by the word 'not' e.g. "I love films" would count as one appearance whilst "I do not love films" would not count as an appearance. Question How would one proceed using the tm package? R Code Below is some self contained code which I would like to modify to do the above. require(tm) # text vector my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely

R and tm package: create a term-document matrix with a dictionary of one or two words?

谁说我不能喝 提交于 2019-12-09 07:01:59
问题 Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses

How to reconnect to the PCorpus in the R tm package?

匆匆过客 提交于 2019-12-08 09:07:30
问题 I create a PCorpus, which as far as I understand is stored on HDD, with the following code: pc = PCorpus(vs, readerControl = list(language = "pl"), dbControl = list(dbName = "pcorpus", dbType = "DB1")) How may I reconnect to that database later? 回答1: You can't as far as I'm aware. The 'database' is actually a filehash object, which you can reconnect to and load as follows, db <- dbInit("pcorpus") pc<-dbLoad(db) but it loads each file as it's own object. You need to save to disk explicitly

Remove stopwords and tolower function slow on a Corpus in R

纵饮孤独 提交于 2019-12-08 06:37:36
问题 I have corpus roughly with 75 MB data. I am trying to use the following command tm_map(doc.corpus, removeWords, stopwords("english")) tm_map(doc.corpus, tolower) This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model. I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed. I have a system with 4GB RAM and running a local database to read the

R: removeCommonTerms with Quanteda package?

拟墨画扇 提交于 2019-12-08 06:01:32
问题 The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } now I would like to remove too common terms with the Quanteda package. I

converting stemmed word to the root word in R

邮差的信 提交于 2019-12-08 05:13:56
问题 Hi I have a list of words which have been stemmed using the "tm" package in R. Can I get back the root word some how after this step. Thanks in Advance. Ex : activiti --> activity 回答1: You can use the stemCompletion() function to achieve this, but you may need to trim the stems first. Consider the following: library(tm) library(qdap) # providers the stemmer() function active.text = "there are plenty of funny activities" active.corp = Corpus(VectorSource(active.text)) (st.text = tolower

Why isn't stemDocument stemming?

放肆的年华 提交于 2019-12-08 04:52:59
问题 I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it. Here is the script for the process, which uses a couple of online news stories as the sandbox: library(boilerpipeR) library(RCurl) library(tm) # Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl' url <- "http://blogs

unable to get tm_map to use mc.cores argument

丶灬走出姿态 提交于 2019-12-08 04:07:29
问题 I have a large corpus with over 10M documents. Whenever I try a transformation over multiple cores using mc.cores argument I get error: Error in FUN(content(x), ...) : unused argument (mc.cores = 10) I have 15 available cores in my current hosted r studio. # I have a corpus > inspect(corpus[1]) <<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 1 [[1]] <<PlainTextDocument>> Metadata: 7 Content: chars: 46 > length(corpus) [1] 10255313 Watch what happens