text-mining

R Text mining - how to change texts in R data frame column into several columns with word frequencies?

£可爱£侵袭症+ 提交于 2019-12-04 15:19:20
I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels. Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts. I tried some functions of the tm package but until now unsatisfactory. Does anyone has any idea how to deal with this problem or where to start? Is

URL path similarity/string similarity algorithm

為{幸葍}努か 提交于 2019-12-04 14:49:55
My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process: # GROUP 1 /robots.txt # GROUP 2 /bot.html # GROUP 3 /phpMyAdmin-2.5.6-rc1/scripts/setup.php /phpMyAdmin-2.5.6-rc2/scripts/setup.php /phpMyAdmin-2.5.6/scripts/setup.php /phpMyAdmin-2.5.7-pl1/scripts/setup.php /phpMyAdmin-2.5.7/scripts/setup.php /phpMyAdmin-2.6.0-alpha/scripts/setup.php /phpMyAdmin-2.6.0-alpha2/scripts/setup.php # GROUP 4 //phpMyAdmin/ I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think

R Tidytext and unnest_tokens error

久未见 提交于 2019-12-04 14:40:04
Very new to R and have started to use the tidytext package. I'm trying to use arguments to feed into the unnest_tokens function so I can do multiple column analysis. So instead of this library(janeaustenr) library(tidytext) library(dplyr) library(stringr) original_books <- austen_books() %>% group_by(book) %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup() original_books tidy_books <- original_books %>% unnest_tokens(word, text) The last line of code would be: output<- 'word' input<- 'text' tidy_books <-

Keeping Track of Word Proximity

女生的网名这么多〃 提交于 2019-12-04 13:55:10
问题 I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be in the proximity of these positive words. I want to construct a matrix such that it contains the document number,positive word and its proximity to negative

arabic text mining using R [closed]

六眼飞鱼酱① 提交于 2019-12-04 13:17:12
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I am a new user and I just want to get help with my work on R. i am doing Arabic text mining and I would love to have some help anyone have experience in this fields. So far I felt to normalize the Arabic text and even R doesn't print the Arabic characters in the console. I am

R: Naives Bayes classifier bases decision only on a-priori probabilities

余生颓废 提交于 2019-12-04 09:55:27
问题 I'm trying to classify tweets according to their sentiment into three categories (Buy, Hold, Sell). I'm using R and the package e1071. I have two data frames: one trainingset and one set of new tweets which sentiment need to be predicted. trainingset dataframe: +--------------------------------------------------+ **text | sentiment** *this stock is a good buy* | Buy *markets crash in tokyo* | Sell *everybody excited about new products* | Hold +-------------------------------------------------

C# Sentiment Analysis [closed]

时光怂恿深爱的人放手 提交于 2019-12-04 08:44:03
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . Does anyone know of a (preferably open source) C# library that can be implemented to calculate the overall sentiment of some given text? 回答1: Take a look at an open source sentiment analysis engine based on Naive Bayes classification at https://github.com/amrishdeep/Dragon. 回答2: http://rapid-i.com/content/view

How to use OpenNLP to get POS tags in R?

 ̄綄美尐妖づ 提交于 2019-12-04 08:40:37
Here is the R Code: library(NLP) library(openNLP) tagPOS <- function(x, ...) { s <- as.String(x) word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2) a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags)} str <- "this is a the first sentence." tagged_str <- tagPOS(str) Output is : tagged_str $POStagged [1]"this

Sentiment Analysis on LARGE collection of online conversation text

南楼画角 提交于 2019-12-04 08:34:21
问题 The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to). The data is organized by Thread , Username , and Post . Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like