tm | 易学教程

tm custom removePunctuation except hashtag

阅读更多关于 tm custom removePunctuation except hashtag

I have a Corpus of tweets from twitter. I clean this corpus (removeWords, tolower, delete URls) and finally also want to remove punctuation. Here is my code: tweetCorpus <- tm_map(tweetCorpus, removePunctuation, preserve_intra_word_dashes = TRUE) The problem now is, that by doing so I also loose the hashtag (#). Is there a way to remove punctuation with tm_map but remain the hashtag? You could adapt the existing removePunctuation to suit your needs. For example removeMostPunctuation<- function (x, preserve_intra_word_dashes = FALSE) { rmpunct <- function(x) { x <- gsub("#", "\002", x) x <-

FUN-error after running 'tolower' while making Twitter wordcloud

阅读更多关于 FUN-error after running 'tolower' while making Twitter wordcloud

Trying to create wordcloud from twitter data, but get the following error: Error in FUN(X[[72L]], ...) : invalid input '��❤�� "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs' This error appears after running the "mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, tolower)" code mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText()) mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list)) mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower) mytwittersearch_corpus<-tm_map( mytwittersearch_corpus,

Removing stopwords from a user-defined corpus in R

阅读更多关于 Removing stopwords from a user-defined corpus in R

I have a set of documents: documents = c("She had toast for breakfast", "The coffee this morning was excellent", "For lunch let's all have pancakes", "Later in the day, there will be more talks", "The talks on the first day were great", "The second day should have good presentations too") In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using: documents = tolower(documents) #make it lower case documents = gsub('[[:punct:]]', '', documents) #remove punctuation First I convert to a Corpus object: documents <- Corpus

R tm package create matrix of Nmost frequent terms

阅读更多关于 R tm package create matrix of Nmost frequent terms

I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B <- Matrix(mydata.dtm, sparse = TRUE) Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as

Programmatically look up a ticker symbol in R

阅读更多关于 Programmatically look up a ticker symbol in R

I have a field of data containing company names, such as company <- c("Microsoft", "Apple", "Cloudera", "Ford") > company Company 1 Microsoft 2 Apple 3 Cloudera 4 Ford and so on. The package tm.plugin.webmining allows you to query data from Yahoo! Finance based on ticker symbols: require(tm.plugin.webmining) results <- WebCorpus(YahooFinanceSource("MSFT")) I'm missing the in-between step. How can I query ticket symbols programmatically based on company names? I couldn't manage to do this with the tm.plugin.webmining package, but I came up with a rough solution - pulling & parsing data from

R and tm package: create a term-document matrix with a dictionary of one or two words?

阅读更多关于 R and tm package: create a term-document matrix with a dictionary of one or two words?

Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses NGramTokenizer in the RWeka package in R , but I ran into a problem . In the example code below, I create

How to scrape web content and then count frequencies of words in R?

阅读更多关于 How to scrape web content and then count frequencies of words in R?

This is my code: library(XML) library(RCurl) url.link <- 'http://www.jamesaltucher.com/sitemap.xml' blog <- getURL(url.link) blog <- htmlParse(blog, encoding = "UTF-8") titles <- xpathSApply (blog ,"//loc",xmlValue) ## titles traverse_each_page <- function(x){ tmp <- htmlParse(x) xpathApply(tmp, '//div[@id="mainContent"]') } pages <- lapply(titles[2:3], traverse_each_page) Here is the pseudocode: Take a xml document: http://www.jamesaltucher.com/sitemap.xml Go to each link Parse the html content of each link Extract the text inside div id="mainContent" Count the frequencies of each word that

Trying to get tf-idf weighting working in R

阅读更多关于 Trying to get tf-idf weighting working in R

问题 I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm')

install.packages(“tm”) -> “dependency 'slam' is not available”

阅读更多关于 install.packages(“tm”) -> “dependency 'slam' is not available”

问题 I'm trying to install the tm package on IBM's Data Science Experience (DSX): install.packages("tm") However, I'm hitting this issue: "dependency 'slam' is not available" This post suggests that R version 3.3.1 will resolve the issue, however the R version on DSX is: R version 3.3.0 (2016-05-03) How can I resolve this issue on IBM DSX? Note that you don't have root access on DSX. I've seen similar questions on stackoverflow, but none are asking how to fix the issue on IBM DSX, e.g. dependency

Trying to get tf-idf weighting working in R

阅读更多关于 Trying to get tf-idf weighting working in R

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm') corpus <- Corpus(DirSource('.')) dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf)) str