tm

Form bigrams without stopwords in R

末鹿安然 提交于 2019-12-24 01:59:15
问题 I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining". Let's say if I have a string as follows: "IBM have a great success in the computer industry for the past decades..." After removing stopwords("have","a","in","the","for"), "IBM great success computer industry past decades..." In a result, bigrams like "success computer" or "industry past" will occur. But what I really need is

dict function for ngrams

戏子无情 提交于 2019-12-24 01:56:05
问题 I have this kind of text: library(dplyr) glimpse(text) chr [1:11] "Welcome to Wikipedia ! [bla] Discover Ekopedia, the practical encyclopedia about alternative life techniques. \"| __truncated__ ... and this kind of bi_grams: glimpse(dict) chr [1:34] "and i" "and the" "as a" "at the" "do not" "for the" "from the" "has been" "i am" "i dont" ... My goal is to build a DocumentTermMatrix from text using the bi_grams of dict . To achieve this I preprocessed text . library(tm) corpus <- VCorpus

Make all words uppercase in Wordcloud in R

老子叫甜甜 提交于 2019-12-23 21:22:11
问题 When creating Wordclouds it is most common to make all the words lowercase. However, I want the wordclouds to display the words uppercase. After forcing the words to be uppercase the wordcloud still display lowercase words. Any ideas why? Reproducable code: library(tm) library(wordcloud) data <- data.frame(text = c("Creativity is the art of being ‘productive’ by using the available resources in a skillful manner. Scientifically speaking, creativity is part of our consciousness and we can be

classifying identically pattern in words using R

与世无争的帅哥 提交于 2019-12-23 17:57:59
问题 I want conduct text mining analysis, but face with any troubles. Using dput(), i load little part of my text. text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L, 3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L, 3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME =

Print first line of one element of Corpus in R using tm package

我的梦境 提交于 2019-12-23 02:52:44
问题 How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal. # Load Libraries library(tm) # Read in Corpus corp <- SimpleCorpus( DirSource( "C:/TextDocument")) # Remove puncuation corp <- removePunctuation(corp, preserve_intra_word_contractions = TRUE, preserve_intra

R: tm Textmining package: Doc-Level metadata generation is slow

試著忘記壹切 提交于 2019-12-22 20:43:34
问题 I have a list of documents to process, and for each record I want to attach some metadata to the document "member" inside the "corpus" data structure that tm, the R package, generates (from reading in text files). This for-loop works but it is very slow , Performance seems to degrade as a function f ~ 1/n_docs. for (i in seq(from= 1, to=length(corpus), by=1)){ if(opts$options$verbose == TRUE || i %% 50 == 0){ print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " ")) } DublinCore(corpus[[i]

How to convert corpus to data.frame with meta data in R

爱⌒轻易说出口 提交于 2019-12-22 10:13:44
问题 how can I convert a corpus into a data frame in R which contains also meta data? I already tried the suggestion from convert corpus into data.frame in R, but the resulting data frame only contains the text lines from all docs in the corpus. I need also the document ID and maybe the line number of the text line in two columns. So, how can I extend this command: dataframe <- data.frame(text=unlist(sapply(mycorpus, [ , "content")), stringsAsFactors=FALSE) to get the data? I already tried

tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

泄露秘密 提交于 2019-12-22 08:01:50
问题 I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB. > corp A corpus with 1859 text documents > mat<-DocumentTermMatrix(corp) > dim(mat) [1] 1859 25722 > is(mat) [1] "DocumentTermMatrix" > mat2<-as.matrix(mat) Fehler: kann Vektor der

tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

狂风中的少年 提交于 2019-12-22 08:01:17
问题 I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB. > corp A corpus with 1859 text documents > mat<-DocumentTermMatrix(corp) > dim(mat) [1] 1859 25722 > is(mat) [1] "DocumentTermMatrix" > mat2<-as.matrix(mat) Fehler: kann Vektor der

Compute ngrams for each row of text data in R

不羁岁月 提交于 2019-12-21 21:43:24
问题 I have a data column of the following format: Text Hello world Hello How are you today I love stackoverflow blah blah blahdy I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately? 回答1: Is this what you're after? library("RWeka") library("tm") TrigramTokenizer <-