text-mining | 易学教程

How to create a good NER training model in OpenNLP?

阅读更多关于 How to create a good NER training model in OpenNLP?

问题 I just have started with OpenNLP. I need to create a simple training model to recognize name entities. Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the model: <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group . <START:person> Rudolph Agnew <END> , 55 years

Finding 2 & 3 word Phrases Using R TM Package

阅读更多关于 Finding 2 & 3 word Phrases Using R TM Package

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck. If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much! You can pass in a custom tokenizing function to tm 's DocumentTermMatrix function, so if you have package tau installed it's fairly straightforward. library(tm); library(tau); tokenize_ngrams <- function(x, n=3)

Detect text language in R [closed]

阅读更多关于 Detect text language in R [closed]

问题 In R I have a list of tweets and I would like to keep only those that are in English. I am wondering if any of you know an R package that provides a simple way to identify the language of a string. Cheers, z 回答1: The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article: Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat

list of word frequencies using R

阅读更多关于 list of word frequencies using R

问题 I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind", lapply(txt, as.data.frame)) names(df) <- "text" myCorpus <- Corpus(VectorSource(df$text)) myStopwords <- c(stopwords('english'),"originally", "posted") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) #building the TDM btm <- function(x) NGramTokenizer

Text Mining in R | memory management

阅读更多关于 Text Mining in R | memory management

问题 I am using a text file of 160 MB and doing data mining, but seems once I convert it to matrix to know the word frequency then its demanding too much memory, can someone one please help me in this > dtm <- DocumentTermMatrix(clean) > dtm <<DocumentTermMatrix (documents: 472029, terms: 171548)>> Non-/sparse entries: 3346670/80972284222 Sparsity : 100% Maximal term length: 126 Weighting : term frequency (tf) > as.matrix(dtm) Error: cannot allocate vector of size 603.3 Gb 回答1: @Vineet here is the

Text Mining R Package & Regex to handle Replace Smart Curly Quotes

阅读更多关于 Text Mining R Package & Regex to handle Replace Smart Curly Quotes

问题 I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes. textclean::replace_non_ascii("You don‘t get “your” money’s worth") Received Output: "You dont get your moneys worth" Expected Output: "You don't get "your" money's worth" Also would appreciate if someone's got the regex to replace every such quotes in one shot.

Emoticons in Twitter Sentiment Analysis in r

阅读更多关于 Emoticons in Twitter Sentiment Analysis in r

问题 How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis? Getting: Error in sort.list(y) : invalid input Thanks and this is how the emoticons come out looking from twitter and into r: \xed��\xed�\u0083\xed��\xed�� \xed��\xed�\u008d\xed��\xed�\u0089 回答1: This should get rid of the emoticons, using iconv as suggested by ndoogan. Some reproducible data: require(twitteR) # note that I had to register my twitter credentials first # here's the method: http:/

Text-mining with the tm-package - word stemming

阅读更多关于 Text-mining with the tm-package - word stemming

问题 I am doing some text mining in R with the tm -package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things). For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.

What is “entropy and information gain”?

阅读更多关于 What is “entropy and information gain”?

I am reading this book ( NLTK ) and it is confusing. Entropy is defined as : Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)? Amro I assume entropy was mentioned in the context of building decision trees . To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m or f , we want to learn a model that fits the data and can be used to

Recognize PDF table using R

阅读更多关于 Recognize PDF table using R

问题 I'm trying to extract data from tables inside some pdf reports. I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables. Is there a way to use R to recognize and extract only tables? 回答1: Awsome question, I wondered about the same thing recently, thanks! I did it, with tabulizer ‘0.2.2’ as @hrbrmstr suggests too. If you are using R version 3.5.2, I'm providing following solution. Install the three