text-mining | 易学教程

Which NLP toolkit to use in JAVA? [closed]

阅读更多关于 Which NLP toolkit to use in JAVA? [closed]

问题 i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results. I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website. What do I have : A list of articles returned from a search. Each article has an ID and an abstract. The idea is to get keywords from each abstract text. And then compare all

Text Mining in R | memory management

阅读更多关于 Text Mining in R | memory management

I am using a text file of 160 MB and doing data mining, but seems once I convert it to matrix to know the word frequency then its demanding too much memory, can someone one please help me in this > dtm <- DocumentTermMatrix(clean) > dtm <<DocumentTermMatrix (documents: 472029, terms: 171548)>> Non-/sparse entries: 3346670/80972284222 Sparsity : 100% Maximal term length: 126 Weighting : term frequency (tf) > as.matrix(dtm) Error: cannot allocate vector of size 603.3 Gb @Vineet here is the math that shows why R tried to allocate 603Gb to convert the document term matrix to a non-sparse matrix.

Text Mining R Package & Regex to handle Replace Smart Curly Quotes

阅读更多关于 Text Mining R Package & Regex to handle Replace Smart Curly Quotes

I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes. textclean::replace_non_ascii("You don‘t get “your” money’s worth") Received Output: "You dont get your moneys worth" Expected Output: "You don't get "your" money's worth" Also would appreciate if someone's got the regex to replace every such quotes in one shot. Thanks! Use two gsub operations: 1) to replace double curly quotes, 2) to replace single quotes: > gsub(

Extracting dates that are in different formats using regex and sorting them - pandas

阅读更多关于 Extracting dates that are in different formats using regex and sorting them - pandas

I am new to text mining and I need to extract the dates from a *.txt file and sort them. The dates are in between the sentences ( each line) and their format can potentially be as follows: 04/20/2009; 04/20/09; 4/20/09; 4/3/09 Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009 Feb 2009; Sep 2009; Oct 2010 6/2008; 12/2009 2009; 2010 If the day is missing consider the 1st and if the month is missing consider January. My idea is to extract all dates and convert that into mm

Detect text language in R [closed]

阅读更多关于 Detect text language in R [closed]

In R I have a list of tweets and I would like to keep only those that are in English. I am wondering if any of you know an R package that provides a simple way to identify the language of a string. Cheers, z The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article: Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17. Here's the

Math of tm::findAssocs how does this function work?

阅读更多关于 Math of tm::findAssocs how does this function work?

问题 I have been using findAssoc() with textmining ( tm package) but realized that something doesn't seem right with my dataset. My dataset is 1500 open ended answers saved in one column of csv file. So I called the dataset like this and used typical tm_map to make it to corpus. library(tm) Q29 <- read.csv("favoritegame2.csv") corpus <- Corpus(VectorSource(Q29$Q29)) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeNumbers) corpus<- tm_map

Emoticons in Twitter Sentiment Analysis in r

阅读更多关于 Emoticons in Twitter Sentiment Analysis in r

How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis? Getting: Error in sort.list(y) : invalid input Thanks and this is how the emoticons come out looking from twitter and into r: \xed��\xed�\u0083\xed��\xed�� \xed��\xed�\u008d\xed��\xed�\u0089 This should get rid of the emoticons, using iconv as suggested by ndoogan. Some reproducible data: require(twitteR) # note that I had to register my twitter credentials first # here's the method: http://stackoverflow.com/q/9916283/1036500 s <- searchTwitter('#emoticons', cainfo="cacert.pem") # convert to data frame df <-

Text-mining with the tm-package - word stemming

阅读更多关于 Text-mining with the tm-package - word stemming

I am doing some text mining in R with the tm -package. Everything works very smooth. However, one problem occurs after stemming ( http://en.wikipedia.org/wiki/Stemming ). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things). For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4. Is there any elegant solution how to implement this for some cases/words manually (e.g. that

list of word frequencies using R

阅读更多关于 list of word frequencies using R

I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind", lapply(txt, as.data.frame)) names(df) <- "text" myCorpus <- Corpus(VectorSource(df$text)) myStopwords <- c(stopwords('english'),"originally", "posted") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) #building the TDM btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm)

Really fast word ngram vectorization in R

阅读更多关于 Really fast word ngram vectorization in R

问题 edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a character vector: #Takes about 15 seconds system.time({ set.seed(1) samplefun <- function(n, x, collapse){ paste(sample(x, n, replace=TRUE), collapse=collapse) } words <- sapply(rpois(10000, 3) + 1, samplefun, letters, '') sents1 <- sapply(rpois