text-mining | 易学教程

R Text mining - how to change texts in R data frame column into several columns with word frequencies?

阅读更多关于 R Text mining - how to change texts in R data frame column into several columns with word frequencies?

I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels. Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts. I tried some functions of the tm package but until now unsatisfactory. Does anyone has any idea how to deal with this problem or where to start? Is

URL path similarity/string similarity algorithm

阅读更多关于 URL path similarity/string similarity algorithm

My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process: # GROUP 1 /robots.txt # GROUP 2 /bot.html # GROUP 3 /phpMyAdmin-2.5.6-rc1/scripts/setup.php /phpMyAdmin-2.5.6-rc2/scripts/setup.php /phpMyAdmin-2.5.6/scripts/setup.php /phpMyAdmin-2.5.7-pl1/scripts/setup.php /phpMyAdmin-2.5.7/scripts/setup.php /phpMyAdmin-2.6.0-alpha/scripts/setup.php /phpMyAdmin-2.6.0-alpha2/scripts/setup.php # GROUP 4 //phpMyAdmin/ I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think

R Tidytext and unnest_tokens error

阅读更多关于 R Tidytext and unnest_tokens error

Very new to R and have started to use the tidytext package. I'm trying to use arguments to feed into the unnest_tokens function so I can do multiple column analysis. So instead of this library(janeaustenr) library(tidytext) library(dplyr) library(stringr) original_books <- austen_books() %>% group_by(book) %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup() original_books tidy_books <- original_books %>% unnest_tokens(word, text) The last line of code would be: output<- 'word' input<- 'text' tidy_books <-

Find HEX patterns and number of occurrences

阅读更多关于 Find HEX patterns and number of occurrences

I'd like to find patterns and sort them by number of occurrences on an HEX file I have. I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.

Keeping Track of Word Proximity

阅读更多关于 Keeping Track of Word Proximity

问题 I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be in the proximity of these positive words. I want to construct a matrix such that it contains the document number,positive word and its proximity to negative

arabic text mining using R [closed]

阅读更多关于 arabic text mining using R [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I am a new user and I just want to get help with my work on R. i am doing Arabic text mining and I would love to have some help anyone have experience in this fields. So far I felt to normalize the Arabic text and even R doesn't print the Arabic characters in the console. I am

R: Naives Bayes classifier bases decision only on a-priori probabilities

阅读更多关于 R: Naives Bayes classifier bases decision only on a-priori probabilities

问题 I'm trying to classify tweets according to their sentiment into three categories (Buy, Hold, Sell). I'm using R and the package e1071. I have two data frames: one trainingset and one set of new tweets which sentiment need to be predicted. trainingset dataframe: +--------------------------------------------------+ **text | sentiment** *this stock is a good buy* | Buy *markets crash in tokyo* | Sell *everybody excited about new products* | Hold +-------------------------------------------------

C# Sentiment Analysis [closed]

阅读更多关于 C# Sentiment Analysis [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . Does anyone know of a (preferably open source) C# library that can be implemented to calculate the overall sentiment of some given text? 回答1: Take a look at an open source sentiment analysis engine based on Naive Bayes classification at https://github.com/amrishdeep/Dragon. 回答2: http://rapid-i.com/content/view

How to use OpenNLP to get POS tags in R?

阅读更多关于 How to use OpenNLP to get POS tags in R?

Here is the R Code: library(NLP) library(openNLP) tagPOS <- function(x, ...) { s <- as.String(x) word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2) a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags)} str <- "this is a the first sentence." tagged_str <- tagPOS(str) Output is : tagged_str $POStagged [1]"this

Sentiment Analysis on LARGE collection of online conversation text

阅读更多关于 Sentiment Analysis on LARGE collection of online conversation text

问题 The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to). The data is organized by Thread , Username , and Post . Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like