tm | 易学教程

DocumentTermMatrix wrong counting when using a dictionary

阅读更多关于 DocumentTermMatrix wrong counting when using a dictionary

actually I am trying to do a sentiment analysis based on twitter data using the naive bayes algorithm. I have a look on 2000 Tweets. After getting the data into R studio I split and preprocess the date as follows: train_size = floor(0.75 * nrow(Tweets_Model_Input)) set.seed(123) train_sub = sample(seq_len(nrow(Tweets_Model_Input)), size = train_size) Tweets_Model_Input_Train = Tweets_Model_Input[train_sub, ] Tweets_Model_Input_Test = Tweets_Model_Input[-train_sub, ] myCorpus = Corpus(VectorSource(Tweets_Model_Input_Train$SentimentText)) myCorpus <- tm_map(myCorpus, removeWords, stopwords(

Trying to remove words from a DocumentTermMatrix in order to use topicmodels

阅读更多关于 Trying to remove words from a DocumentTermMatrix in order to use topicmodels

问题 So, I am trying to use the topicmodels package for R (100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory. So I try to shrink the size of the document term matrix that the lda() function takes as input; I figure I can do that do using the minDocFreq function when I generate my document term matrices. But when I use it, it doesn't seem to make any difference. Here is some code: Here is the relevant

Counter ngram with tm package in R

阅读更多关于 Counter ngram with tm package in R

问题 I created a script for the frequency of words in a document using the object and a dictionary documentTermMatrix in R. The script works on individual words and not on the compound word es. "foo" "bar" "foo bar" This is the code require(tm) my.docs <- c("foo bar word1 word2") myCorpus <- Corpus(VectorSource(my.docs)) inspect(DocumentTermMatrix(myCorpus,list(dictionary = c("foo","bar","foo bar")))) But the result is Terms Docs bar foo foo bar 1 1 1 0 I would have to find one "foo bar" = 1 How

How to load packages in R

阅读更多关于 How to load packages in R

问题 I have successfully installed the tm package, which is located in: C:\Users\JustinLiang\Documents\R\win-library\3.0 After type library() , it shows me the R packages available list: Packages in library ‘C:/Users/JustinLiang/Documents/R/win-library/3.0’: tm Text Mining Package Packages in library ‘C:/Program Files/R/R-3.0.2/library’: however, when I try to load the package: library(tm) , it shows me an error: Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is

How to convert corpus to data.frame with meta data in R

阅读更多关于 How to convert corpus to data.frame with meta data in R

how can I convert a corpus into a data frame in R which contains also meta data? I already tried the suggestion from convert corpus into data.frame in R , but the resulting data frame only contains the text lines from all docs in the corpus. I need also the document ID and maybe the line number of the text line in two columns. So, how can I extend this command: dataframe <- data.frame(text=unlist(sapply(mycorpus, [ , "content")), stringsAsFactors=FALSE) to get the data? I already tried dataframe <- data.frame(id=sapply(corpus, meta(corpus, "id")), text=unlist(sapply(corpus, `[`, "content")),

DocumentTermMatrix fails with a strange error only when # terms > 3000

阅读更多关于 DocumentTermMatrix fails with a strange error only when # terms > 3000

问题 My code below works fine unless I use create a DocumentTermMatrix with more that 3000 terms. This line: movie_dict <- findFreqTerms(movie_dtm_train, 8) movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict)) Fails with: Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths In addition: Warning messages: 1: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered

Big Text Corpus breaks tm_map

阅读更多关于 Big Text Corpus breaks tm_map

问题 I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus,

Removing rows from Corpus with multiple documents

阅读更多关于 Removing rows from Corpus with multiple documents

I have 4000 text documents in corpus. I want to remove row(s) that contains a specific word from each document as a part of data clean up. For example: library(tm) doc.corpus<- VCorpus(DirSource("C:\\TextMining\\Prototype",pattern="*.txt",encoding= "UTF8",mode = "text"),readerControl=list(language="en")) doc.corpus<- tm_map(doc.corpus, PlainTextDocument) doc.corpus[[1]] #PlainTextDocument Metadata: 7 Content: chars: 16542 as.character(doc.corpus)[[1]] $content "Quick to deploy, easy to use, and offering complete investment protection, our product is clearly differentiated from all competitive

how to read text in a table from a csv file

阅读更多关于 how to read text in a table from a csv file

I am new using the tm package. I want to read a csv file which contents one column with 2000 texts and a second column with a factor variable yes/no into a Corpus. My intention is to convert the text as a matrix and use the factor variable as target for prediction. I would need to divide the corpus in train and test sets as well. I read several documents like tm.pdf etc. and found the documentation relatively limited. This is my attempt following another threat on the same subject, TexTest<-read.csv("C:/Test.csv") m <- list(Text = "Text", Clasification = "Classification") corpus1 <- Corpus(x

NLP process for combining common collocations

阅读更多关于 NLP process for combining common collocations

I have a corpus that I'm using the tm package on in R (and also mirroring the same script in NLTK in python). I'm working with unigrams, but would like a parser of some kind to combine words commonly co-located to be as if one word---ie, I'd like to stop seeing "New" and "York" separately in my data set when they occur together, and see this particular pair represented as "New York" as if that were a single word, and alongside other unigrams. What is this process called, of transforming meaningful, common n-grams onto the same footing as unigrams? Is it not a thing? Finally, what would the tm