tm | 易学教程

Form bigrams without stopwords in R

阅读更多关于 Form bigrams without stopwords in R

问题 I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining". Let's say if I have a string as follows: "IBM have a great success in the computer industry for the past decades..." After removing stopwords("have","a","in","the","for"), "IBM great success computer industry past decades..." In a result, bigrams like "success computer" or "industry past" will occur. But what I really need is

dict function for ngrams

阅读更多关于 dict function for ngrams

问题 I have this kind of text: library(dplyr) glimpse(text) chr [1:11] "Welcome to Wikipedia ! [bla] Discover Ekopedia, the practical encyclopedia about alternative life techniques. \"| __truncated__ ... and this kind of bi_grams: glimpse(dict) chr [1:34] "and i" "and the" "as a" "at the" "do not" "for the" "from the" "has been" "i am" "i dont" ... My goal is to build a DocumentTermMatrix from text using the bi_grams of dict . To achieve this I preprocessed text . library(tm) corpus <- VCorpus

Make all words uppercase in Wordcloud in R

阅读更多关于 Make all words uppercase in Wordcloud in R

问题 When creating Wordclouds it is most common to make all the words lowercase. However, I want the wordclouds to display the words uppercase. After forcing the words to be uppercase the wordcloud still display lowercase words. Any ideas why? Reproducable code: library(tm) library(wordcloud) data <- data.frame(text = c("Creativity is the art of being ‘productive’ by using the available resources in a skillful manner. Scientifically speaking, creativity is part of our consciousness and we can be

classifying identically pattern in words using R

阅读更多关于 classifying identically pattern in words using R

问题 I want conduct text mining analysis, but face with any troubles. Using dput(), i load little part of my text. text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L, 3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L, 3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME =

Print first line of one element of Corpus in R using tm package

阅读更多关于 Print first line of one element of Corpus in R using tm package

问题 How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal. # Load Libraries library(tm) # Read in Corpus corp <- SimpleCorpus( DirSource( "C:/TextDocument")) # Remove puncuation corp <- removePunctuation(corp, preserve_intra_word_contractions = TRUE, preserve_intra

R: tm Textmining package: Doc-Level metadata generation is slow

阅读更多关于 R: tm Textmining package: Doc-Level metadata generation is slow

问题 I have a list of documents to process, and for each record I want to attach some metadata to the document "member" inside the "corpus" data structure that tm, the R package, generates (from reading in text files). This for-loop works but it is very slow , Performance seems to degrade as a function f ~ 1/n_docs. for (i in seq(from= 1, to=length(corpus), by=1)){ if(opts$options$verbose == TRUE || i %% 50 == 0){ print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " ")) } DublinCore(corpus[[i]

How to convert corpus to data.frame with meta data in R

阅读更多关于 How to convert corpus to data.frame with meta data in R

问题 how can I convert a corpus into a data frame in R which contains also meta data? I already tried the suggestion from convert corpus into data.frame in R, but the resulting data frame only contains the text lines from all docs in the corpus. I need also the document ID and maybe the line number of the text line in two columns. So, how can I extend this command: dataframe <- data.frame(text=unlist(sapply(mycorpus, [ , "content")), stringsAsFactors=FALSE) to get the data? I already tried

tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

阅读更多关于 tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

问题 I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB. > corp A corpus with 1859 text documents > mat<-DocumentTermMatrix(corp) > dim(mat) [1] 1859 25722 > is(mat) [1] "DocumentTermMatrix" > mat2<-as.matrix(mat) Fehler: kann Vektor der

tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

阅读更多关于 tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

Compute ngrams for each row of text data in R

阅读更多关于 Compute ngrams for each row of text data in R

问题 I have a data column of the following format: Text Hello world Hello How are you today I love stackoverflow blah blah blahdy I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately? 回答1: Is this what you're after? library("RWeka") library("tm") TrigramTokenizer <-