tm | 易学教程

tm.plugin.sentiment issue. Error: could not find function “DMetaData”

阅读更多关于 tm.plugin.sentiment issue. Error: could not find function “DMetaData”

问题 I have tried countless times in different ways to run the score() function in the tm.plugin.sentiment package in R but I keep getting the same error. This is a sample code: library(tm.plugin.webmining) library(tm.plugin.sentiment) cor <- WebCorpus(GoogleFinanceSource("NASDAQ:MSFT")) tm_tag_score <- tm_term_score corpus <- score(cor) This is the error I get: Error in score(cor) : could not find function "DMetaData" 回答1: Looks like it's caused by the removal of the DMetaData function from the

Turn Unicode into Umlaut in R on Mac (Facebook Data)

阅读更多关于 Turn Unicode into Umlaut in R on Mac (Facebook Data)

问题 I did a lot of research on this and I still can't find a solution to this. I have extracted data from a German Facebook group that looks like from_ID from_name message created_time 12334543 Max Muster Dies war auch eine sehr sch<U+00F6>ne Bucht 2016-01-08T19:00:54+0000 I understand that <U+00F6> stands for the German Umlat ö. There are many other examples of Unicode replacing German Umlaute or other language specifc signs (no matter which language). No matter if I want to do a sentiment

R How do i keep punctuation with TermDocumentMatrix()

阅读更多关于 R How do i keep punctuation with TermDocumentMatrix()

问题 I have a large dataframe where I am identifying patterns in strings and then extracting them. I have provided a small subset to illustrate my task. I am generating my patterns by creating a TermDocumentMatrix with multiple words. I use these patterns with stri_extract and str_replace from stringi and stringr packages to search within the 'punct_prob' dataframe. My problem is that I need to keep punctuation in tact within the 'punct_prob$description' to maintain the literal meanings within

DocumentTermMatrix wrong counting when using a dictionary

阅读更多关于 DocumentTermMatrix wrong counting when using a dictionary

问题 actually I am trying to do a sentiment analysis based on twitter data using the naive bayes algorithm. I have a look on 2000 Tweets. After getting the data into R studio I split and preprocess the date as follows: train_size = floor(0.75 * nrow(Tweets_Model_Input)) set.seed(123) train_sub = sample(seq_len(nrow(Tweets_Model_Input)), size = train_size) Tweets_Model_Input_Train = Tweets_Model_Input[train_sub, ] Tweets_Model_Input_Test = Tweets_Model_Input[-train_sub, ] myCorpus = Corpus

Remove ngrams with leading and trailing stopwords

阅读更多关于 Remove ngrams with leading and trailing stopwords

问题 I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords. I have about 100 pdf files. I converted them to plain-text files through an Adobe batch command and collected them within a single directory. From there I use R. (It's a patchwork of code because I'm just getting started with text mining.) My code: library(tm) # Make path for sub-dir which contains corpus files path <- file.path(getwd(),

R: removeCommonTerms with Quanteda package?

阅读更多关于 R: removeCommonTerms with Quanteda package?

The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } now I would like to remove too common terms with the Quanteda package. I could do this removal before creating the Document-feature matrix or with the document-feature matrix.

Print first line of one element of Corpus in R using tm package

阅读更多关于 Print first line of one element of Corpus in R using tm package

How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal. # Load Libraries library(tm) # Read in Corpus corp <- SimpleCorpus( DirSource( "C:/TextDocument")) # Remove puncuation corp <- removePunctuation(corp, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE) I have tried accessing the corpus several ways: # Print first line of first element of

R tm package: utf-8 text

阅读更多关于 R tm package: utf-8 text

问题 I would like to create a wordcloud for non-english text in utf-8 (actually, it's in kazakh language). The text is displayed absolutely right in inspect function of the tm package. However, when I search for word frequency everything is displayed incorrectly: The problem is that the text is displayed with coded characters instead of words. Cyrillic characters are displayed correctly. Consquently the wordcloud becomes a complete mess. Is it possible to assign encoding to the tm function somehow

Snowball Stemmer only stems last word

阅读更多关于 Snowball Stemmer only stems last word

问题 I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed. library(tm) library(Snowball) library(RWeka) library(rJava) path <- c("C:/path/to/diretory") corp <- Corpus(DirSource(path), readerControl = list(reader = readPlain, language = "en_US", load = TRUE)) tm_map(corp,SnowballStemmer) #stemDocument has the same problem I think it is

converting stemmed word to the root word in R

阅读更多关于 converting stemmed word to the root word in R

Hi I have a list of words which have been stemmed using the "tm" package in R. Can I get back the root word some how after this step. Thanks in Advance. Ex : activiti --> activity You can use the stemCompletion() function to achieve this, but you may need to trim the stems first. Consider the following: library(tm) library(qdap) # providers the stemmer() function active.text = "there are plenty of funny activities" active.corp = Corpus(VectorSource(active.text)) (st.text = tolower(stemmer(active.text,warn=F))) # this is what the columns of your Term Document Matrix are going to look like [1]