tm | 易学教程

Document-Term-Matrix of tm Package in R

阅读更多关于 Document-Term-Matrix of tm Package in R

问题 I am using document term matrix of tm package in R. I faced an error saying: Doc <- DocumentTermMatrix(Data) Error in UseMethod("TermDocumentMatrix", x) : no applicable method for 'TermDocumentMatrix' applied to an object of class "table" I tried data frame, data table, matrix and table but I faced the error again and again. Could you please tell me what should I do? 回答1: You missed a step... you have to create a "corpus" first... library("tm") txt <- c("some text", "here in this", "vector as

How to print textual representation of single documents stored in a tm corpus in R?

阅读更多关于 How to print textual representation of single documents stored in a tm corpus in R?

问题 I was using {tm} package and then generated a corpus using corpus = Corpus(VectorSource(sample.words)) then I want to check the content in corpus ,but it print this instead of its texts: > corpus <<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 3933 Now I have found some methods to look into corpus ,then I started wondering what exactly R print when an object was typed in ? > class(corpus) [1] "VCorpus" "Corpus" > typeof(corpus) [1] "list" Why it didn

DocumentTermMatrix() return 0 terms in tm package

阅读更多关于 DocumentTermMatrix() return 0 terms in tm package

问题 I have an object like that: str(apps) chr [1:17517] "35 44 33 40 33 40 44 38 33 37 37" ... In each row, the number is separated by space. corpus<-Corpus(VectorSource(apps)) dtm<-DocumentTermMatrix(corpus) str(dtm) List of 6 $ i : int(0) $ j : int(0) $ v : num(0) $ nrow : int 17517 $ ncol : int 0 $ dimnames:List of 2 ..$ Docs : chr [1:17517] "1" "2" "3" "4" ... ..$ Terms: NULL - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix" - attr(*, "weighting")= chr [1:2] "term

Assigning weights to different features in R

阅读更多关于 Assigning weights to different features in R

问题 Is it possible to assign weights to different features before formulating a DFM in R? Consider this example in R str="apple is better than banana" mydfm=dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE) DFM mydfm looks like: docs apple better banana text1 1 1 1 But, I want to assign weights(apple:5, banana:3) beforehand, so that DFM mydfm looks like: docs apple better banana text1 5 1 3 回答1: I don't think so, however you can easily do it afterwards: library(quanteda) str <-

How is the correct use of stemDocument?

阅读更多关于 How is the correct use of stemDocument?

问题 I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map . Let's follow this example: q17 <- VCorpus(VectorSource(x = c("poder", "pode")), readerControl = list(language = "pt", load = TRUE)) lapply(q17, content) $`character(0)` [1] "poder" $`character(0)` [1] "pode" If I use: > stemDocument("poder", language = "portuguese") [1] "pod" > stemDocument("pode", language = "portuguese") [1] "pod" it does work! But if I use: > q17 <- tm_map(q17,

R Warning in stemCompletion and error in TermDocumentMatrix

阅读更多关于 R Warning in stemCompletion and error in TermDocumentMatrix

问题 I was followed the instruction from here In slide no. 9 tolower has issue in package tm 0.6 and above I have used myCorpus <- tm_map(myCorpus, content_transformer(tolower) it was duplicate from this stackoverflow but i still get error when run stemCompletion myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) And I follow this instruction for both variable myCorpus and myCorpusCopy to PlainTextDocument corpus <- tm_map(corpus, PlainTextDocument) I was able to execute

R: tm package reading in Newsgroups data

阅读更多关于 R: tm package reading in Newsgroups data

问题 The lines of code below return the following error: ">"object 'readNewsgroup' not found library(tm) setwd("C:/Users/DanRoDuq/Downloads/20news-bydate-train") sci.electr.train=Corpus(DirSource("sci.electronics") ,readerControl=list(reader=readNewsgroup,language="en_US")) I got the data from: http://qwone.com/~jason/20Newsgroups/lexData.text, and downloaded the file called 20news-bydate.tar.gz When I replace readNewsgroup by readPlain, the code runs, but the instructions I'm following tell me to

R: inspect Document Term Matrix results in Error: Repeated indices currently not allowed

阅读更多关于 R: inspect Document Term Matrix results in Error: Repeated indices currently not allowed

问题 I have the following dummy data: final6 <- data.frame(docname = paste0("doc", 1:6), articles = c("Catalonia independence in matter of days", "Anger over Johnson Libya bodies comment", "Man admits frenzied mum and son murder", "The headache that changed my life", "Las Vegas killer sick, demented - Trump", "Instagram baby photo scammer banned") ) And I want to create a DocumentTermMatrix with reference to document names (that I could later link to the original article text). To achieve this, I

Loop through a tm corpus without losing corpus structure

阅读更多关于 Loop through a tm corpus without losing corpus structure

问题 I have a tm corpus of documents and a list of words. I want to run a for loop over the corpus, so that the loop removes each word in the list from the corpus sequentially. Some replication data: library(tm) m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"), c(1, 2, 3)) tm_corpus <- Corpus(VectorSource(m[,1])) words <- as.list(c("Apple", "yellow", "two")) tm_corpus is now a corpus object consisting of 3 documents: <<SimpleCorpus>> Metadata: corpus specific: 1, document

Keeping Turkish characters with the text mining package for R

阅读更多关于 Keeping Turkish characters with the text mining package for R

问题 let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R. Here's what I did: docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur")) writeLines(as.character(docs), con="documents.txt") My thinking being, that setting the language to Turkish and the encoding