tm | 易学教程

Issue in DocumentTermMatrix with corpus in German

阅读更多关于 Issue in DocumentTermMatrix with corpus in German

问题 I created a corpus in R using package tm specifying language and encoding as follows: de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl = list(language="de_DE",encoding = "UTF_8")) de_DE.corpus[36]$content de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list (encoding = 'UTF-8')) inspect(de_DE.dtm[, grepl("grÃ", de_DE.dtm$dimnames$Terms)]) inspect(de_DE.dtm[36, ]) If I see the content in de_DE.corpus[36]$content of document 36 which has 'ü' the text is shown correctly. e

Retaining unique identifiers (e.g., record id) when using tm functions - doesn't work with lot's of data?

阅读更多关于 Retaining unique identifiers (e.g., record id) when using tm functions - doesn't work with lot's of data?

问题 I am working with unstructured text (Facebook) data, and am pre-processing it (e.g., stripping punctuation, removing stop words, stemming). I need to retain the record (i.e., Facebook post) ids while pre-processing. I have a solution that works on a subset of the data but fails with all the data (N = 127K posts). I have tried chunking the data, and that doesn't work either. I think it has something to do with me using a work-around, and relying on row names. For example, it appears to work

Quotes and hyphens not removed by tm package functions while cleaning corpus

阅读更多关于 Quotes and hyphens not removed by tm package functions while cleaning corpus

问题 I'm trying to clean the corpus and I've used the typical steps, like the code below: docs<-Corpus(DirSource(path)) docs<-tm_map(docs,content_transformer(tolower)) docs<-tm_map(docs,content_transformer(removeNumbers)) docs<-tm_map(docs,content_transformer(removePunctuation)) docs<-tm_map(docs,removeWords,stopwords('en')) docs<-tm_map(docs,stripWhitespace) docs<-tm_map(docs,stemDocument) dtm<-DocumentTermMatrix(docs) Yet when I inspect the matrix there are few words that come with quotes, such

How to remove rows from a data frame that contain only few words in R?

阅读更多关于 How to remove rows from a data frame that contain only few words in R?

问题 I'm trying to remove rows from my data frame that contain less than 5 words. e.g. mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE) head(mydf) NO ARTICLE 1 34 The New York Times reports a lot of words here. 2 12 Greenwire reports a lot of words. 3 31 Only three words. 4 2 The Financial Times reports a lot of words. 5 9 Greenwire short. 6 13 The New York Times reports a lot of words again. I want to remove rows with 5 or less words. how can i do that? 回答1: Here are two ways:

The readTabular() function is gone in the newest version of tm. What do we use as its replacement?

阅读更多关于 The readTabular() function is gone in the newest version of tm. What do we use as its replacement?

问题 In tm version 0.7-1, there was a readTabular() function. Now it is gone, and if you try to use it, there is no deprecation message or warning or anything, like you might reasonably expect. It's just gone. In the distant past (4 days ago), it could be used like: library(tm) myReader <- tm::readTabular(mapping=list(id="id", content="content")) cor <- tm::VCorpus(tm::DataframeSource(dt), readerControl = list(reader = myReader)) So how do you do something like that in the newest version of tm , 0

The readTabular() function is gone in the newest version of tm. What do we use as its replacement?

阅读更多关于 The readTabular() function is gone in the newest version of tm. What do we use as its replacement?

Hierarchical clustering using cosine distance in R

阅读更多关于 Hierarchical clustering using cosine distance in R

问题 I want to do hierarchical clustering by using cosine similarity with the R programming language for corpus of documents, but I got the following error: Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : missing value where TRUE/FALSE needed What should I do? To reproduce it, here's an example: library(tm) doc <- c( "The sky is blue.", "The sun is bright today.", "The sun in the sky is bright.", "We can see the shining sun, the bright sun." ) doc_corpus <- Corpus

Mapping the topic of the review in R

阅读更多关于 Mapping the topic of the review in R

问题 I have two data sets, Review Data & Topic Data Dput code of my Review Data structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", "Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA, -2L)) Dput code of my Topic Data structure(list(word = structure(2:1, .Label = c("canteen food", "sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")),

replace range of numbers with single numbers in a character string

阅读更多关于 replace range of numbers with single numbers in a character string

问题 Is there any way to replace range of numbers wih single numbers in a character string? Number can range from n-n, most probably around 1-15, 4-10 ist also possible. the range could be indicated with a) - a <- "I would like to buy 1-3 cats" or with a word b) for example: to, bis, jusqu'à b <- "I would like to buy 1 jusqu'à 3 cats" The results should look like "I would like to buy 1,2,3 cats" I found this: Replace range of numbers with certain number but could not really use it in R. 回答1:

replace range of numbers with single numbers in a character string

阅读更多关于 replace range of numbers with single numbers in a character string