tm | 易学教程

R Text mining - how to change texts in R data frame column into several columns with word frequencies?

阅读更多关于 R Text mining - how to change texts in R data frame column into several columns with word frequencies?

I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels. Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts. I tried some functions of the tm package but until now unsatisfactory. Does anyone has any idea how to deal with this problem or where to start? Is

Programmatically look up a ticker symbol in R

阅读更多关于 Programmatically look up a ticker symbol in R

问题 I have a field of data containing company names, such as company <- c("Microsoft", "Apple", "Cloudera", "Ford") > company Company 1 Microsoft 2 Apple 3 Cloudera 4 Ford and so on. The package tm.plugin.webmining allows you to query data from Yahoo! Finance based on ticker symbols: require(tm.plugin.webmining) results <- WebCorpus(YahooFinanceSource("MSFT")) I'm missing the in-between step. How can I query ticket symbols programmatically based on company names? 回答1: I couldn't manage to do this

Big Text Corpus breaks tm_map

阅读更多关于 Big Text Corpus breaks tm_map

I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus, it fails me. I have pasted in the code for one such folder operation. library(tm) # Framework for text

How to extract sentences containing specific person names using R

阅读更多关于 How to extract sentences containing specific person names using R

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph: Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority

Error installing old packages in R

阅读更多关于 Error installing old packages in R

问题 I'm trying to install a 0.6-2 version of the tm library. I've downloaded the tar.gz file from the archive and in RStudio selected Tools -> Archive -> Package Archive File to install it. However, I'm getting the following error. Can someone help me fix this please: installing source package 'tm' ... ** package 'tm' successfully unpacked and MD5 sums checked ** libs * arch - i386 c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-33~1.2/include" -DNDEBUG -I"d:/Compiler/gcc-4.9.3/local330/include"

Example of tm use

阅读更多关于 Example of tm use

Can you give an example of use of tm (I don't know how to initialize that struct ) where the current date is written in this format y/m/d ? How to use tm structure call time() to get current date/time as number of seconds since 1 Jan 1970. call localtime() to get struct tm pointer. If you want GMT them call gmtime() instead of localtime() . Use sprintf() or strftime() to convert the struct tm to a string in any format you want. Example #include <stdio.h> #include <time.h> int main () { time_t rawtime; struct tm * timeinfo; char buffer [80]; time ( &rawtime ); timeinfo = localtime ( &rawtime );

Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

阅读更多关于 Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set! trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set. here's my (not so beautiful) code: # create data.frame with labelled sentences labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T)) # create training set and test set traindata <- as.data.frame(labeled[1:700,c("ARTICLE","CLASS")]) testdata <- as.data.frame(labeled[701:1000, c("ARTICLE","CLASS")]) # Vector,

tm loses the metadata when applying tm_map

阅读更多关于 tm loses the metadata when applying tm_map

I have a (small) problem with the tm r library. say I have a corpus: # boilerplate bcorp <- c("one","two","three","four","five") myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US")) tdm <- TermDocumentMatrix(myCorpus) Docs(tdm) Result: [1] "1" "2" "3" "4" "5" This works. But when I try to use a transformation tm_map(): # this does not work myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US")) myCorpus <- tm_map(myCorpus, tolower) tdm <- TermDocumentMatrix(myCorpus) Gives Error: inherits(doc, "TextDocument") is not TRUE The solution proposed in this case was to transform

How to write custom removePunctuation() function to better deal with Unicode chars?

阅读更多关于 How to write custom removePunctuation() function to better deal with Unicode chars?

问题 In the source code of the tm text-mining R-package, in file transform.R, there is the removePunctuation() function, currently defined as: function(x, preserve_intra_word_dashes = FALSE) { if (!preserve_intra_word_dashes) gsub("[[:punct:]]+", "", x) else { # Assume there are no ASCII 1 characters. x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x) x <- gsub("[[:punct:]]+", "", x) gsub("\1", "-", x, fixed = TRUE) } } I need to parse and mine some abstracts from a science conference (fetched from their

remove duplicates from list based on semantic similarity/relatedness

阅读更多关于 remove duplicates from list based on semantic similarity/relatedness

问题 R + tm: How do I de-duplicate items in a list, based on semantic similarity? v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv") . My expected solution would be c("bank", "ford_suv',"toyota_suv","nissan_suv") . That is, bank, banks and banking to be reduced to one term "bank." SnowBall::stemming is not an option because I have to retain the flavor of newspaper styles of various countries. Any help or direction will be useful. 回答1: We could calculate the Levenshtein distance