text-mining

How to create a good NER training model in OpenNLP?

一笑奈何 提交于 2019-11-27 01:56:46
问题 I just have started with OpenNLP. I need to create a simple training model to recognize name entities. Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the model: <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group . <START:person> Rudolph Agnew <END> , 55 years

Finding 2 & 3 word Phrases Using R TM Package

风流意气都作罢 提交于 2019-11-27 00:10:17
I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck. If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much! You can pass in a custom tokenizing function to tm 's DocumentTermMatrix function, so if you have package tau installed it's fairly straightforward. library(tm); library(tau); tokenize_ngrams <- function(x, n=3)

Detect text language in R [closed]

杀马特。学长 韩版系。学妹 提交于 2019-11-26 22:36:04
问题 In R I have a list of tweets and I would like to keep only those that are in English. I am wondering if any of you know an R package that provides a simple way to identify the language of a string. Cheers, z 回答1: The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article: Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat

list of word frequencies using R

旧巷老猫 提交于 2019-11-26 22:21:50
问题 I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind", lapply(txt, as.data.frame)) names(df) <- "text" myCorpus <- Corpus(VectorSource(df$text)) myStopwords <- c(stopwords('english'),"originally", "posted") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) #building the TDM btm <- function(x) NGramTokenizer

Text Mining in R | memory management

◇◆丶佛笑我妖孽 提交于 2019-11-26 22:01:39
问题 I am using a text file of 160 MB and doing data mining, but seems once I convert it to matrix to know the word frequency then its demanding too much memory, can someone one please help me in this > dtm <- DocumentTermMatrix(clean) > dtm <<DocumentTermMatrix (documents: 472029, terms: 171548)>> Non-/sparse entries: 3346670/80972284222 Sparsity : 100% Maximal term length: 126 Weighting : term frequency (tf) > as.matrix(dtm) Error: cannot allocate vector of size 603.3 Gb 回答1: @Vineet here is the

Text Mining R Package & Regex to handle Replace Smart Curly Quotes

前提是你 提交于 2019-11-26 21:55:45
问题 I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes. textclean::replace_non_ascii("You don‘t get “your” money’s worth") Received Output: "You dont get your moneys worth" Expected Output: "You don't get "your" money's worth" Also would appreciate if someone's got the regex to replace every such quotes in one shot.

Emoticons in Twitter Sentiment Analysis in r

泪湿孤枕 提交于 2019-11-26 18:18:43
问题 How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis? Getting: Error in sort.list(y) : invalid input Thanks and this is how the emoticons come out looking from twitter and into r: \xed��\xed�\u0083\xed��\xed�� \xed��\xed�\u008d\xed��\xed�\u0089 回答1: This should get rid of the emoticons, using iconv as suggested by ndoogan. Some reproducible data: require(twitteR) # note that I had to register my twitter credentials first # here's the method: http:/

Text-mining with the tm-package - word stemming

安稳与你 提交于 2019-11-26 18:14:40
问题 I am doing some text mining in R with the tm -package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things). For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.

What is “entropy and information gain”?

一笑奈何 提交于 2019-11-26 18:02:20
I am reading this book ( NLTK ) and it is confusing. Entropy is defined as : Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)? Amro I assume entropy was mentioned in the context of building decision trees . To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m or f , we want to learn a model that fits the data and can be used to

Recognize PDF table using R

自闭症网瘾萝莉.ら 提交于 2019-11-26 14:13:53
问题 I'm trying to extract data from tables inside some pdf reports. I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables. Is there a way to use R to recognize and extract only tables? 回答1: Awsome question, I wondered about the same thing recently, thanks! I did it, with tabulizer ‘0.2.2’ as @hrbrmstr suggests too. If you are using R version 3.5.2, I'm providing following solution. Install the three