tm

Error faced while using TM package's VCorpus in R

最后都变了- 提交于 2019-12-01 16:59:40
I am facing the below error while working on the TM package with R. library("tm") Loading required package: NLP Warning messages: 1: package ‘tm’ was built under R version 3.4.2 2: package ‘NLP’ was built under R version 3.4.1 corpus <- VCorpus(DataframeSource(data)) Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE Have tried various ways like reinstalling the package, updating with new version of R but the error still persists. For the same data file the same code runs on another system with the same version of R. Eva I met the same problem when I updated the tm package to

Plot highly correlated words against a specific word of interest [closed]

房东的猫 提交于 2019-12-01 13:37:37
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 6 years ago . I am trying to plot the highest correlation of a word. For example, I want to graph the highest ten correlations of the word "whale." Can someone help me with the command for something like that? I have RGraphViz installed if that helps. s.dir1<-"/PATHTOTEXT/MobyDickTxt" s.cor1<-Corpus(DirSource(s

Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

白昼怎懂夜的黑 提交于 2019-12-01 13:12:19
问题 I try to create a matrix, for this I would like to tolower text. For this I use this R instruction : matrix = create_matrix(tweets[,1], toLower = TRUE, language="english", removeStopwords=FALSE, removeNumbers=TRUE, stemWords=TRUE) Here the R code : library(RTextTools) library(e1071) pos_tweets = rbind( c('j AIME la voiture', 'positive'), c('cette machine est performante', 'positive'), c('je me sens en bonne forme ce matin', 'positive'), c('je suis super excitée d aller voir le spectacle de

Remove unicode <+f0b7> from Corpus text

我是研究僧i 提交于 2019-12-01 12:36:08
问题 I'm having a pretty stubborn issue... I can't seem to remove the <+f0b7> and <+f0a0> string from Corpora that have loaded from *.txt files into R: UPDATE Here's a link to the sample .txt file: https://db.tt/qTRKpJYK Corpus(DirSource("./SomeDirectory/txt/"), readerControl = list(reader = readPlain)) title professional staff - contract - permanent position software c microfocus cobol unix btrieve ibm vm-cms vsam cics jcl accomplishments <+f0b7> <+f0a0> responsible maintaining billing system

big document term matrix - error when counting the number of characters of documents

∥☆過路亽.° 提交于 2019-12-01 12:02:07
问题 I have built a big document-term matrix with the package RTextTools . Now I am trying to count the number of characters in the matrix rows so that I can remove empty documents before performing topic modeling. My code gives no errors when I apply it to a sample of my corpus, obtaining a smaller matrix, but when I try to count the row length of the documents in the matrix produced from my entire corpus (~75000 tweets) I get the following error message: Error in vector(typeof(x$v), nr * nc) :

How to calculate proximity of words to a specific term in a document

落花浮王杯 提交于 2019-12-01 10:55:54
I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text: song <- "Far over the misty mountains cold To dungeons deep and caverns old We must away ere break of day To seek the pale enchanted gold. The dwarves of yore made mighty spells, While hammers fell like ringing bells In places deep, where dark things sleep, In hollow halls beneath the fells. For

How to calculate proximity of words to a specific term in a document

心已入冬 提交于 2019-12-01 08:54:23
问题 I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text: song <- "Far over the misty mountains cold To dungeons deep and caverns old We must away ere break of day To seek the pale enchanted gold. The dwarves of yore made mighty spells, While hammers

multiple results of one variable when applying tm method “stemCompletion”

回眸只為那壹抹淺笑 提交于 2019-12-01 08:51:20
I have a corpus containing journal data of 15 observations of 3 variables (ID, title, abstract). Using R Studio I read in the data from a .csv file (one line per observation). When performing some text mining operations I got some trouble when using the method stemCompletion. After applying stemCompletion I observed that the results are provided for each stemmed line of the .csv three times. All the other tm methods (e.g. stemDocument) produce only a single result. I'm wondering why this happens and how I could fix the problem I used the code below: data.corpus <- Corpus(DataframeSource(data))

In R tm package, build corpus FROM Document-Term-Matrix

断了今生、忘了曾经 提交于 2019-12-01 06:35:30
It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab

multiple results of one variable when applying tm method “stemCompletion”

▼魔方 西西 提交于 2019-12-01 06:08:24
问题 I have a corpus containing journal data of 15 observations of 3 variables (ID, title, abstract). Using R Studio I read in the data from a .csv file (one line per observation). When performing some text mining operations I got some trouble when using the method stemCompletion. After applying stemCompletion I observed that the results are provided for each stemmed line of the .csv three times. All the other tm methods (e.g. stemDocument) produce only a single result. I'm wondering why this