text-mining

Error using “TermDocumentMatrix” and “Dist” functions in R

柔情痞子 提交于 2019-12-06 13:39:30
I have been trying to replicate the example here : but I have had some problems along the way. Everything worked fine until here: docsTDM <- TermDocumentMatrix(docs8) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" In addition: Warning message: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code So I was able to fix that error modifying this previous step by changing this: docs8 <- tm_map(docs7, tolower) To this: docs8 <- tm_map(docs7, content_transformer(tolower)) But then I got in

Stock Tweets, Text Mining, Emoticon Erros

徘徊边缘 提交于 2019-12-06 13:29:06
I was hoping you'd be able to assist in a text mining exercise. I was interested in 'AAPL' tweets, and was able to pull 500 tweets from the API. I was able to clear several hurdles on my own, but need help for last part. For some reason, the tm package is not removing stopwords. Can you please take a look and see what the problem might be? Are emoticons causing an issue? After plotting Term_Frequency, the most frequent terms are "AAPL", "Apple", "iPhone", "Price", "Stock" Thanks in advance! Munckinn transform into dataframe tweets.df <- twListToDF(tweets) #Isolate text from tweets aapl_tweets

Wordcloud showing colour based on continous metadata in R

半城伤御伤魂 提交于 2019-12-06 12:35:58
I'm creating a wordcloud in which the size of the words is based on frequency, but i want the colour of the words to be mapped to a third variable (stress, which is the amount of stress associated with each word, a numerical or continuous variable). I tried the following, which gave me only two different colours (yellow and purple) while i want something more smooth. I would like some color range like a palette that goes from green to red for example. df = data.frame(word = c("calling", "meeting", "conference", "contract", "negotiation", "email"), n = c(20, 12, 4, 8, 10, 43), stress = c(23, 30

Create a term frequency matrix using 2 columns from a csv file, in R?

谁说我不能喝 提交于 2019-12-06 12:31:41
问题 I'm new to R. I'm mining data which is present in csv file - summaries of reports in one column, date of report in another column and report's agency in the thrid column. I need to investigate how terms associated with ‘fraud’ have changed over time or vary by agency. I've filtered the rows containing the term 'fraud' and created a new csv file. How can I create a term freq matrix with years as rows and terms as columns so that I can look for top freq terms and do some clustering? Basically,

How to keep the beginning and end of sentence markers with quanteda

a 夏天 提交于 2019-12-06 12:08:25
问题 I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below. I thought that using the keptFeatures with a regular expression that matched those should maintain them but the chevron markers are always removed. How can I keep the chevron markers from being removed or what is the best way to delimit beginning and end of sentence with quanteda ? As a bonus question what is

R Text mining - how to change texts in R data frame column into several columns with word frequencies?

牧云@^-^@ 提交于 2019-12-06 12:00:47
问题 I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels. Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts. I tried some functions of the tm package but

Does tm package itself provide a built-in way to combine document-term matrices?

烂漫一生 提交于 2019-12-06 09:53:38
问题 Does tm package itself provide a built-in way to combine document-term matrices? I generated 4 document term matrices on the same corpus, each for 1,2,3,4 gram. They are all really big: 200k*10k so converting them into data frames and then cbinding them is out of question. I know I could write a program recording the non-zero elements in each of the matrices and build a sparse-matrix, but that is a lot of trouble. It just seems natural for tm package to provide this functionality. So if it

SVM for Text Mining using scikit

拜拜、爱过 提交于 2019-12-06 09:22:37
Can someone share a code snippet that shows how to use SVM for text mining using scikit. I have seen an example of SVM on numerical data but not quite sure how to deal with text. I looked at http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html but couldn't find SVM. In text mining problems, text is represented by numeric values. Each feature represent a word and values are binary numbers. That gives a matrix with lots of zeros and a few 1s which means that the corresponding words exist in the text. Words can be given some weights according to their frequency

how to write output from rapidminer to a txt file?

流过昼夜 提交于 2019-12-06 05:38:26
i am using rapidminer 5.3.I took a small document which contains around three english sentences , tokenized it and filtered it with respect to the length of words.i want to write the output into a different word document.i tried using Write document utility but it is not working,it is simply writing the same original document into the new one.However when i write the output to the console,it gives me the expected answer.Something wrong with the write document utility. Here is my process READ DOCUMENT --> TOKENIZE --> FILTER TOKENS --> WRITE DOCUMENT Try the following Cut Document (with (\S+)