tm

Issue in DocumentTermMatrix with corpus in German

余生颓废 提交于 2021-02-18 18:53:42
问题 I created a corpus in R using package tm specifying language and encoding as follows: de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl = list(language="de_DE",encoding = "UTF_8")) de_DE.corpus[36]$content de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list (encoding = 'UTF-8')) inspect(de_DE.dtm[, grepl("grÃ", de_DE.dtm$dimnames$Terms)]) inspect(de_DE.dtm[36, ]) If I see the content in de_DE.corpus[36]$content of document 36 which has 'ü' the text is shown correctly. e

Retaining unique identifiers (e.g., record id) when using tm functions - doesn't work with lot's of data?

元气小坏坏 提交于 2021-02-10 16:22:34
问题 I am working with unstructured text (Facebook) data, and am pre-processing it (e.g., stripping punctuation, removing stop words, stemming). I need to retain the record (i.e., Facebook post) ids while pre-processing. I have a solution that works on a subset of the data but fails with all the data (N = 127K posts). I have tried chunking the data, and that doesn't work either. I think it has something to do with me using a work-around, and relying on row names. For example, it appears to work

Quotes and hyphens not removed by tm package functions while cleaning corpus

左心房为你撑大大i 提交于 2021-02-07 12:39:26
问题 I'm trying to clean the corpus and I've used the typical steps, like the code below: docs<-Corpus(DirSource(path)) docs<-tm_map(docs,content_transformer(tolower)) docs<-tm_map(docs,content_transformer(removeNumbers)) docs<-tm_map(docs,content_transformer(removePunctuation)) docs<-tm_map(docs,removeWords,stopwords('en')) docs<-tm_map(docs,stripWhitespace) docs<-tm_map(docs,stemDocument) dtm<-DocumentTermMatrix(docs) Yet when I inspect the matrix there are few words that come with quotes, such

How to remove rows from a data frame that contain only few words in R?

梦想的初衷 提交于 2021-02-05 07:55:10
问题 I'm trying to remove rows from my data frame that contain less than 5 words. e.g. mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE) head(mydf) NO ARTICLE 1 34 The New York Times reports a lot of words here. 2 12 Greenwire reports a lot of words. 3 31 Only three words. 4 2 The Financial Times reports a lot of words. 5 9 Greenwire short. 6 13 The New York Times reports a lot of words again. I want to remove rows with 5 or less words. how can i do that? 回答1: Here are two ways:

The readTabular() function is gone in the newest version of tm. What do we use as its replacement?

耗尽温柔 提交于 2020-07-30 05:52:50
问题 In tm version 0.7-1, there was a readTabular() function. Now it is gone, and if you try to use it, there is no deprecation message or warning or anything, like you might reasonably expect. It's just gone. In the distant past (4 days ago), it could be used like: library(tm) myReader <- tm::readTabular(mapping=list(id="id", content="content")) cor <- tm::VCorpus(tm::DataframeSource(dt), readerControl = list(reader = myReader)) So how do you do something like that in the newest version of tm , 0

The readTabular() function is gone in the newest version of tm. What do we use as its replacement?

 ̄綄美尐妖づ 提交于 2020-07-30 05:52:09
问题 In tm version 0.7-1, there was a readTabular() function. Now it is gone, and if you try to use it, there is no deprecation message or warning or anything, like you might reasonably expect. It's just gone. In the distant past (4 days ago), it could be used like: library(tm) myReader <- tm::readTabular(mapping=list(id="id", content="content")) cor <- tm::VCorpus(tm::DataframeSource(dt), readerControl = list(reader = myReader)) So how do you do something like that in the newest version of tm , 0

Hierarchical clustering using cosine distance in R

岁酱吖の 提交于 2020-07-19 11:07:37
问题 I want to do hierarchical clustering by using cosine similarity with the R programming language for corpus of documents, but I got the following error: Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : missing value where TRUE/FALSE needed What should I do? To reproduce it, here's an example: library(tm) doc <- c( "The sky is blue.", "The sun is bright today.", "The sun in the sky is bright.", "We can see the shining sun, the bright sun." ) doc_corpus <- Corpus

Mapping the topic of the review in R

╄→尐↘猪︶ㄣ 提交于 2020-07-18 07:59:24
问题 I have two data sets, Review Data & Topic Data Dput code of my Review Data structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", "Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA, -2L)) Dput code of my Topic Data structure(list(word = structure(2:1, .Label = c("canteen food", "sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")),

replace range of numbers with single numbers in a character string

陌路散爱 提交于 2020-05-29 05:52:55
问题 Is there any way to replace range of numbers wih single numbers in a character string? Number can range from n-n, most probably around 1-15, 4-10 ist also possible. the range could be indicated with a) - a <- "I would like to buy 1-3 cats" or with a word b) for example: to, bis, jusqu'à b <- "I would like to buy 1 jusqu'à 3 cats" The results should look like "I would like to buy 1,2,3 cats" I found this: Replace range of numbers with certain number but could not really use it in R. 回答1:

replace range of numbers with single numbers in a character string

只愿长相守 提交于 2020-05-29 05:52:18
问题 Is there any way to replace range of numbers wih single numbers in a character string? Number can range from n-n, most probably around 1-15, 4-10 ist also possible. the range could be indicated with a) - a <- "I would like to buy 1-3 cats" or with a word b) for example: to, bis, jusqu'à b <- "I would like to buy 1 jusqu'à 3 cats" The results should look like "I would like to buy 1,2,3 cats" I found this: Replace range of numbers with certain number but could not really use it in R. 回答1: