tm | 易学教程

R tm package create matrix of Nmost frequent terms

阅读更多关于 R tm package create matrix of Nmost frequent terms

问题 I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B

How to scrape web content and then count frequencies of words in R?

阅读更多关于 How to scrape web content and then count frequencies of words in R?

问题 This is my code: library(XML) library(RCurl) url.link <- 'http://www.jamesaltucher.com/sitemap.xml' blog <- getURL(url.link) blog <- htmlParse(blog, encoding = "UTF-8") titles <- xpathSApply (blog ,"//loc",xmlValue) ## titles traverse_each_page <- function(x){ tmp <- htmlParse(x) xpathApply(tmp, '//div[@id="mainContent"]') } pages <- lapply(titles[2:3], traverse_each_page) Here is the pseudocode: Take a xml document: http://www.jamesaltucher.com/sitemap.xml Go to each link Parse the html

How to implement proximity rules in tm dictionary for counting words?

阅读更多关于 How to implement proximity rules in tm dictionary for counting words?

问题 Objective I would like to count the number of times the word "love" appears in a documents but only if it isn't preceded by the word 'not' e.g. "I love films" would count as one appearance whilst "I do not love films" would not count as an appearance. Question How would one proceed using the tm package? R Code Below is some self contained code which I would like to modify to do the above. require(tm) # text vector my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely

R and tm package: create a term-document matrix with a dictionary of one or two words?

阅读更多关于 R and tm package: create a term-document matrix with a dictionary of one or two words?

问题 Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses

How to reconnect to the PCorpus in the R tm package?

阅读更多关于 How to reconnect to the PCorpus in the R tm package?

问题 I create a PCorpus, which as far as I understand is stored on HDD, with the following code: pc = PCorpus(vs, readerControl = list(language = "pl"), dbControl = list(dbName = "pcorpus", dbType = "DB1")) How may I reconnect to that database later? 回答1: You can't as far as I'm aware. The 'database' is actually a filehash object, which you can reconnect to and load as follows, db <- dbInit("pcorpus") pc<-dbLoad(db) but it loads each file as it's own object. You need to save to disk explicitly

Remove stopwords and tolower function slow on a Corpus in R

阅读更多关于 Remove stopwords and tolower function slow on a Corpus in R

问题 I have corpus roughly with 75 MB data. I am trying to use the following command tm_map(doc.corpus, removeWords, stopwords("english")) tm_map(doc.corpus, tolower) This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model. I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed. I have a system with 4GB RAM and running a local database to read the

R: removeCommonTerms with Quanteda package?

阅读更多关于 R: removeCommonTerms with Quanteda package?

问题 The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } now I would like to remove too common terms with the Quanteda package. I

converting stemmed word to the root word in R

阅读更多关于 converting stemmed word to the root word in R

问题 Hi I have a list of words which have been stemmed using the "tm" package in R. Can I get back the root word some how after this step. Thanks in Advance. Ex : activiti --> activity 回答1: You can use the stemCompletion() function to achieve this, but you may need to trim the stems first. Consider the following: library(tm) library(qdap) # providers the stemmer() function active.text = "there are plenty of funny activities" active.corp = Corpus(VectorSource(active.text)) (st.text = tolower

Why isn't stemDocument stemming?

阅读更多关于 Why isn't stemDocument stemming?

问题 I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it. Here is the script for the process, which uses a couple of online news stories as the sandbox: library(boilerpipeR) library(RCurl) library(tm) # Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl' url <- "http://blogs

unable to get tm_map to use mc.cores argument

阅读更多关于 unable to get tm_map to use mc.cores argument

问题 I have a large corpus with over 10M documents. Whenever I try a transformation over multiple cores using mc.cores argument I get error: Error in FUN(content(x), ...) : unused argument (mc.cores = 10) I have 15 available cores in my current hosted r studio. # I have a corpus > inspect(corpus[1]) <<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 1 [[1]] <<PlainTextDocument>> Metadata: 7 Content: chars: 46 > length(corpus) [1] 10255313 Watch what happens