text-mining | 易学教程

Error using “TermDocumentMatrix” and “Dist” functions in R

阅读更多关于 Error using “TermDocumentMatrix” and “Dist” functions in R

I have been trying to replicate the example here : but I have had some problems along the way. Everything worked fine until here: docsTDM <- TermDocumentMatrix(docs8) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" In addition: Warning message: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code So I was able to fix that error modifying this previous step by changing this: docs8 <- tm_map(docs7, tolower) To this: docs8 <- tm_map(docs7, content_transformer(tolower)) But then I got in

Stock Tweets, Text Mining, Emoticon Erros

阅读更多关于 Stock Tweets, Text Mining, Emoticon Erros

I was hoping you'd be able to assist in a text mining exercise. I was interested in 'AAPL' tweets, and was able to pull 500 tweets from the API. I was able to clear several hurdles on my own, but need help for last part. For some reason, the tm package is not removing stopwords. Can you please take a look and see what the problem might be? Are emoticons causing an issue? After plotting Term_Frequency, the most frequent terms are "AAPL", "Apple", "iPhone", "Price", "Stock" Thanks in advance! Munckinn transform into dataframe tweets.df <- twListToDF(tweets) #Isolate text from tweets aapl_tweets

Wordcloud showing colour based on continous metadata in R

阅读更多关于 Wordcloud showing colour based on continous metadata in R

I'm creating a wordcloud in which the size of the words is based on frequency, but i want the colour of the words to be mapped to a third variable (stress, which is the amount of stress associated with each word, a numerical or continuous variable). I tried the following, which gave me only two different colours (yellow and purple) while i want something more smooth. I would like some color range like a palette that goes from green to red for example. df = data.frame(word = c("calling", "meeting", "conference", "contract", "negotiation", "email"), n = c(20, 12, 4, 8, 10, 43), stress = c(23, 30

Create a term frequency matrix using 2 columns from a csv file, in R?

阅读更多关于 Create a term frequency matrix using 2 columns from a csv file, in R?

问题 I'm new to R. I'm mining data which is present in csv file - summaries of reports in one column, date of report in another column and report's agency in the thrid column. I need to investigate how terms associated with ‘fraud’ have changed over time or vary by agency. I've filtered the rows containing the term 'fraud' and created a new csv file. How can I create a term freq matrix with years as rows and terms as columns so that I can look for top freq terms and do some clustering? Basically,

How to keep the beginning and end of sentence markers with quanteda

阅读更多关于 How to keep the beginning and end of sentence markers with quanteda

问题 I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below. I thought that using the keptFeatures with a regular expression that matched those should maintain them but the chevron markers are always removed. How can I keep the chevron markers from being removed or what is the best way to delimit beginning and end of sentence with quanteda ? As a bonus question what is

R Text mining - how to change texts in R data frame column into several columns with word frequencies?

阅读更多关于 R Text mining - how to change texts in R data frame column into several columns with word frequencies?

问题 I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels. Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts. I tried some functions of the tm package but

Does tm package itself provide a built-in way to combine document-term matrices?

阅读更多关于 Does tm package itself provide a built-in way to combine document-term matrices?

问题 Does tm package itself provide a built-in way to combine document-term matrices? I generated 4 document term matrices on the same corpus, each for 1,2,3,4 gram. They are all really big: 200k*10k so converting them into data frames and then cbinding them is out of question. I know I could write a program recording the non-zero elements in each of the matrices and build a sparse-matrix, but that is a lot of trouble. It just seems natural for tm package to provide this functionality. So if it

SVM for Text Mining using scikit

阅读更多关于 SVM for Text Mining using scikit

Can someone share a code snippet that shows how to use SVM for text mining using scikit. I have seen an example of SVM on numerical data but not quite sure how to deal with text. I looked at http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html but couldn't find SVM. In text mining problems, text is represented by numeric values. Each feature represent a word and values are binary numbers. That gives a matrix with lots of zeros and a few 1s which means that the corresponding words exist in the text. Words can be given some weights according to their frequency

Find HEX patterns and number of occurrences

阅读更多关于 Find HEX patterns and number of occurrences

问题 I'd like to find patterns and sort them by number of occurrences on an HEX file I have. I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.

how to write output from rapidminer to a txt file?

阅读更多关于 how to write output from rapidminer to a txt file?

i am using rapidminer 5.3.I took a small document which contains around three english sentences , tokenized it and filtered it with respect to the length of words.i want to write the output into a different word document.i tried using Write document utility but it is not working,it is simply writing the same original document into the new one.However when i write the output to the console,it gives me the expected answer.Something wrong with the write document utility. Here is my process READ DOCUMENT --> TOKENIZE --> FILTER TOKENS --> WRITE DOCUMENT Try the following Cut Document (with (\S+)