tm

tm custom removePunctuation except hashtag

故事扮演 提交于 2019-12-03 17:28:33
I have a Corpus of tweets from twitter. I clean this corpus (removeWords, tolower, delete URls) and finally also want to remove punctuation. Here is my code: tweetCorpus <- tm_map(tweetCorpus, removePunctuation, preserve_intra_word_dashes = TRUE) The problem now is, that by doing so I also loose the hashtag (#). Is there a way to remove punctuation with tm_map but remain the hashtag? You could adapt the existing removePunctuation to suit your needs. For example removeMostPunctuation<- function (x, preserve_intra_word_dashes = FALSE) { rmpunct <- function(x) { x <- gsub("#", "\002", x) x <-

FUN-error after running 'tolower' while making Twitter wordcloud

南笙酒味 提交于 2019-12-03 16:28:26
Trying to create wordcloud from twitter data, but get the following error: Error in FUN(X[[72L]], ...) : invalid input '������������❤������������ "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs' This error appears after running the "mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, tolower)" code mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText()) mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list)) mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower) mytwittersearch_corpus<-tm_map( mytwittersearch_corpus,

Removing stopwords from a user-defined corpus in R

谁说我不能喝 提交于 2019-12-03 16:13:45
I have a set of documents: documents = c("She had toast for breakfast", "The coffee this morning was excellent", "For lunch let's all have pancakes", "Later in the day, there will be more talks", "The talks on the first day were great", "The second day should have good presentations too") In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using: documents = tolower(documents) #make it lower case documents = gsub('[[:punct:]]', '', documents) #remove punctuation First I convert to a Corpus object: documents <- Corpus

R tm package create matrix of Nmost frequent terms

蹲街弑〆低调 提交于 2019-12-03 13:43:43
I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B <- Matrix(mydata.dtm, sparse = TRUE) Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as

Programmatically look up a ticker symbol in R

旧城冷巷雨未停 提交于 2019-12-03 09:54:41
I have a field of data containing company names, such as company <- c("Microsoft", "Apple", "Cloudera", "Ford") > company Company 1 Microsoft 2 Apple 3 Cloudera 4 Ford and so on. The package tm.plugin.webmining allows you to query data from Yahoo! Finance based on ticker symbols: require(tm.plugin.webmining) results <- WebCorpus(YahooFinanceSource("MSFT")) I'm missing the in-between step. How can I query ticket symbols programmatically based on company names? I couldn't manage to do this with the tm.plugin.webmining package, but I came up with a rough solution - pulling & parsing data from

R and tm package: create a term-document matrix with a dictionary of one or two words?

本秂侑毒 提交于 2019-12-03 08:55:08
Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses NGramTokenizer in the RWeka package in R , but I ran into a problem . In the example code below, I create

How to scrape web content and then count frequencies of words in R?

血红的双手。 提交于 2019-12-03 08:51:53
This is my code: library(XML) library(RCurl) url.link <- 'http://www.jamesaltucher.com/sitemap.xml' blog <- getURL(url.link) blog <- htmlParse(blog, encoding = "UTF-8") titles <- xpathSApply (blog ,"//loc",xmlValue) ## titles traverse_each_page <- function(x){ tmp <- htmlParse(x) xpathApply(tmp, '//div[@id="mainContent"]') } pages <- lapply(titles[2:3], traverse_each_page) Here is the pseudocode: Take a xml document: http://www.jamesaltucher.com/sitemap.xml Go to each link Parse the html content of each link Extract the text inside div id="mainContent" Count the frequencies of each word that

Trying to get tf-idf weighting working in R

折月煮酒 提交于 2019-12-03 02:52:10
问题 I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm')

install.packages(“tm”) -> “dependency 'slam' is not available”

北城余情 提交于 2019-12-02 18:22:36
问题 I'm trying to install the tm package on IBM's Data Science Experience (DSX): install.packages("tm") However, I'm hitting this issue: "dependency 'slam' is not available" This post suggests that R version 3.3.1 will resolve the issue, however the R version on DSX is: R version 3.3.0 (2016-05-03) How can I resolve this issue on IBM DSX? Note that you don't have root access on DSX. I've seen similar questions on stackoverflow, but none are asking how to fix the issue on IBM DSX, e.g. dependency

Trying to get tf-idf weighting working in R

隐身守侯 提交于 2019-12-02 17:33:25
I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm') corpus <- Corpus(DirSource('.')) dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf)) str