tm

R text mining: grouping similar words using stemDocuments in tm package

半世苍凉 提交于 2020-04-18 06:10:15
问题 I am doing text mining of around 30000 tweets, Now the problem is to make results more reliable i want to convert "synonyms" to similar words for ex. some user use words "girl", some use "girls", some use "gal". similarly "give","gave" means only one thing. same for "come,"came". some user use short-form like "plz","pls" etc. Also, "stemdocument" from tm package is not working properly. it's is converting dance to danc, table to tabl.....is there any other good package for stemming. I want to

Matching a list of phrases to a corpus of documents and returning phrase frequency

我与影子孤独终老i 提交于 2020-02-27 12:04:24
问题 I have a list of phrases and a corpus of documents.There are 100k+ phrases and 60k+ documents in the corpus. The phrases are might/might not present in the corpus. I'm looking forward to find the term frequency of each phrase present in the corpus. An example dataset: Phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning") Doc1 <- "If you're just starting with workout, begin slow." Doc2 <- "Don't jump in brain initial and

finding key phrases using tm package in r

南楼画角 提交于 2020-02-25 03:06:48
问题 I have a project requiring me to search annual reports of various companies and find key phrases in them. I have converted the reports to text files, created and cleaned a corpus. I then created a document term matrix. The tm_term_score function only seems to work for single words and not phrases. Is it possible to search the corpus for key phrases (not necessarily the most frequent)? For example - I want to see how many times the phrase “supply chain finance” in each document in the corpus.

tm: read in data frame, keep text id's, construct DTM and join to other dataset

[亡魂溺海] 提交于 2020-01-29 02:30:24
问题 I'm using package tm. Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..." Now I want to create a document-term matrix from this data frame. My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being

tm: read in data frame, keep text id's, construct DTM and join to other dataset

醉酒当歌 提交于 2020-01-29 02:30:06
问题 I'm using package tm. Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..." Now I want to create a document-term matrix from this data frame. My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being

stemDocument R text mining

牧云@^-^@ 提交于 2020-01-15 05:44:07
问题 My data is a txt file and looks as follows: words number_doc overwiew 1 client 1 store 1 marge 1 price 2 stock 2 economics 2 The numbers of the documents are sorted (from the smallest to the largest). Now I want for each document all the words that belongs to the document. Now they stand in a column, but I want al the words in a textDocument (from the package tm, because it is neccesary for some functions in that package). I did this as follows: data <- read.table("poging.txt", header = TRUE)

negation handling in R, how can I replace a word following a negation in R?

谁说我不能喝 提交于 2020-01-13 20:23:10
问题 I'm doing sentiment analysis for financial articles. To enhance the accuracy of my naive Bayes classifier, I'd like to implement negation handling. Specifically, I want to add the prefix "not_" to the word following a "not" or "n't" So if there's something like this in my corpus: x <- "They didn't sell the company." I want to get the following: "they didn't not_sell the company." (the stopword "didn't" will be removed later) I could find only the gsub() function, but it doesn't seem to work

negation handling in R, how can I replace a word following a negation in R?

不羁的心 提交于 2020-01-13 20:23:07
问题 I'm doing sentiment analysis for financial articles. To enhance the accuracy of my naive Bayes classifier, I'd like to implement negation handling. Specifically, I want to add the prefix "not_" to the word following a "not" or "n't" So if there's something like this in my corpus: x <- "They didn't sell the company." I want to get the following: "they didn't not_sell the company." (the stopword "didn't" will be removed later) I could find only the gsub() function, but it doesn't seem to work

Frequency Per Term - R TM DocumentTermMatrix

ⅰ亾dé卋堺 提交于 2020-01-13 11:33:15
问题 I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them. Ideally, I would like: Term # "the" 200 "is" 400 "a" 200 Currently my code is: library(tm) common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you") x <- Corpus(VectorSource(results)) x <- tm_map(x, stripWhitespace) x <- tm_map(x, removeNumbers) x <

Lemmatization using txt file with lemmes in R

↘锁芯ラ 提交于 2020-01-13 06:42:25
问题 I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance