text-mining | 易学教程

converting stemmed word to the root word in R

阅读更多关于 converting stemmed word to the root word in R

问题 Hi I have a list of words which have been stemmed using the "tm" package in R. Can I get back the root word some how after this step. Thanks in Advance. Ex : activiti --> activity 回答1: You can use the stemCompletion() function to achieve this, but you may need to trim the stems first. Consider the following: library(tm) library(qdap) # providers the stemmer() function active.text = "there are plenty of funny activities" active.corp = Corpus(VectorSource(active.text)) (st.text = tolower

Why isn't stemDocument stemming?

阅读更多关于 Why isn't stemDocument stemming?

问题 I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it. Here is the script for the process, which uses a couple of online news stories as the sandbox: library(boilerpipeR) library(RCurl) library(tm) # Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl' url <- "http://blogs

tm.plugin.sentiment issue. Error: could not find function “DMetaData”

阅读更多关于 tm.plugin.sentiment issue. Error: could not find function “DMetaData”

问题 I have tried countless times in different ways to run the score() function in the tm.plugin.sentiment package in R but I keep getting the same error. This is a sample code: library(tm.plugin.webmining) library(tm.plugin.sentiment) cor <- WebCorpus(GoogleFinanceSource("NASDAQ:MSFT")) tm_tag_score <- tm_term_score corpus <- score(cor) This is the error I get: Error in score(cor) : could not find function "DMetaData" 回答1: Looks like it's caused by the removal of the DMetaData function from the

Is there an algorithm for determining the relevance of a text to a theme?

阅读更多关于 Is there an algorithm for determining the relevance of a text to a theme?

问题 I want to know what can be used to determine the relevance of a page for a theme like games, movies, etc. Is there some research in this area or is there only counting how many times some relevant words appear? 回答1: The common choice is supervised document classification on bag of words (or bag of n-grams) features, preferably with tf-idf weighting. Popular algorithms include Naive Bayes and (linear) SVMs. For this approach, you'll need labeled training data, i.e. documents annotated with

Error using “TermDocumentMatrix” and “Dist” functions in R

阅读更多关于 Error using “TermDocumentMatrix” and “Dist” functions in R

问题 I have been trying to replicate the example here: but I have had some problems along the way. Everything worked fine until here: docsTDM <- TermDocumentMatrix(docs8) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" In addition: Warning message: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code So I was able to fix that error modifying this previous step by changing this: docs8 <- tm

How is polarity calculated for a sentence ??? (in sentiment analysis)

阅读更多关于 How is polarity calculated for a sentence ??? (in sentiment analysis)

问题 How is polarity of words in a statement are calculated....like "i am successful in accomplishing the task,but in vain" how each word is scored? (like - successful- 0.7 accomplishing- 0.8 but - -0.5 vain - - 0.8) how is it calculated ? how is each word given a value or score?? what is the thing that's going behind ? As i am doing sentiment analysis I have few thing to be clear so .that would be great if someone helps.thanks in advance 回答1: If you are willing to use Python and NLTK, then check

SVM for Text Mining using scikit

阅读更多关于 SVM for Text Mining using scikit

问题 Can someone share a code snippet that shows how to use SVM for text mining using scikit. I have seen an example of SVM on numerical data but not quite sure how to deal with text. I looked at http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html but couldn't find SVM. 回答1: In text mining problems, text is represented by numeric values. Each feature represent a word and values are binary numbers. That gives a matrix with lots of zeros and a few 1s which means

Stock Tweets, Text Mining, Emoticon Erros

阅读更多关于 Stock Tweets, Text Mining, Emoticon Erros

问题 I was hoping you'd be able to assist in a text mining exercise. I was interested in 'AAPL' tweets, and was able to pull 500 tweets from the API. I was able to clear several hurdles on my own, but need help for last part. For some reason, the tm package is not removing stopwords. Can you please take a look and see what the problem might be? Are emoticons causing an issue? After plotting Term_Frequency, the most frequent terms are "AAPL", "Apple", "iPhone", "Price", "Stock" Thanks in advance!

Remove ngrams with leading and trailing stopwords

阅读更多关于 Remove ngrams with leading and trailing stopwords

问题 I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords. I have about 100 pdf files. I converted them to plain-text files through an Adobe batch command and collected them within a single directory. From there I use R. (It's a patchwork of code because I'm just getting started with text mining.) My code: library(tm) # Make path for sub-dir which contains corpus files path <- file.path(getwd(),

Is it possible to append words to an existing OpenNLP POS corpus/model?

阅读更多关于 Is it possible to append words to an existing OpenNLP POS corpus/model?

问题 Is there a way to train the existing Apache OpenNLP POS Tagger model? I need to add a few more proper nouns to the model that are specific to my application. When I try to use the below command: opennlp POSTaggerTrainer -type maxent -model en-pos-maxent.bin \ -lang en -data en-pos.train -encoding UTF-8 the entire model is retrained. I'd only like to append a few new sentences to en-pos-maxent.bin This is how my training file looks: Where_WRB is_VBZ the_DT Seven_DNNP Dwarfs_DNNP Mine_DNNP