quanteda | 易学教程

R: initialise empty dgCMatrix given by matrix multiplication of two Quanteda DFM sparse matrices?

阅读更多关于 R: initialise empty dgCMatrix given by matrix multiplication of two Quanteda DFM sparse matrices?

问题 I have for loop like this, trying to implement the solution here, with dummy vars such that aaa <- DFM %*% t(DFM) #DFM is Quanteda dfm-sparse-matrix for(i in 1:nrow(aaa)) aaa[i,] <- aaa[i,][order(aaa[i,], decreasing = TRUE)] but now for(i in 1:nrow(mmm)) mmm[i,] <- aaa[i,][order(aaa[i,], decreasing = TRUE)] where mmm does not exist yet, the goal is to do the same thing as mmm <- t(apply(a, 1, sort, decreasing = TRUE)) . But now before the for loop I need to initialise the mmm otherwise Error:

R: sparse matrix multiplication with data.table and quanteda package?

阅读更多关于 R: sparse matrix multiplication with data.table and quanteda package?

问题 I am trying to create a matrix mulptiplication with sparse matrix and with the package called quanteda, utilising data.table package, related to this thread here. So require(quanteda) mytext <- c("Let the big dogs hunt", "No holds barred", "My child is an honor student") myMatrix <-dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE) #a data.table as.matrix(myMatrix) %*% transpose(as.matrix(myMatrix)) how can you get the matrix multiplication working here with quanteda package and

A lemmatizing function using a hash dictionary does not work with tm package in R

阅读更多关于 A lemmatizing function using a hash dictionary does not work with tm package in R

问题 I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm.

R: removeCommonTerms with Quanteda package?

阅读更多关于 R: removeCommonTerms with Quanteda package?

问题 The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } now I would like to remove too common terms with the Quanteda package. I

Remove ngrams with leading and trailing stopwords

阅读更多关于 Remove ngrams with leading and trailing stopwords

问题 I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords. I have about 100 pdf files. I converted them to plain-text files through an Adobe batch command and collected them within a single directory. From there I use R. (It's a patchwork of code because I'm just getting started with text mining.) My code: library(tm) # Make path for sub-dir which contains corpus files path <- file.path(getwd(),

R: removeCommonTerms with Quanteda package?

阅读更多关于 R: removeCommonTerms with Quanteda package?

The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } now I would like to remove too common terms with the Quanteda package. I could do this removal before creating the Document-feature matrix or with the document-feature matrix.

Logical combinations in quanteda dictionaries

阅读更多关于 Logical combinations in quanteda dictionaries

问题 I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words. For example: Teddybear = (fluffy AND adorable AND soft) Is this possible? I only found a solution yet to test for phrases like (Teddybear = (soft fluffy adorable)) . But then it has to be the exact phrase match in the text. But how can I get results neglecting the order of the words? 回答1: This is not currently something that is directly possible in quanteda (v1.2.0).

QUANTEDA - invalid class “dfmSparse” object

阅读更多关于 QUANTEDA - invalid class “dfmSparse” object

I get this warning-message. I use these data: https://github.com/kbenoit/quanteda/tree/master/data/data_char_inaugural.RData RStudio version: Version 1.0.136 – © 2009-2016 RStudio, Inc. library(quanteda) uk2010immigCorpus <- corpus(data_char_ukimmig2010, docvars = data.frame(party = names(data_char_ukimmig2010)),metacorpus = list(notes = "Immigration-related sections of 2010 UK party manifestos")) mydfm <- dfm(uk2010immigCorpus, remove = c("will", stopwords("english")),removePunct = TRUE) Error in validObject(.Object) : invalid class “dfmSparse” object: superclass "replValueSp" not defined in

How to keep the beginning and end of sentence markers with quanteda

阅读更多关于 How to keep the beginning and end of sentence markers with quanteda

问题 I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below. I thought that using the keptFeatures with a regular expression that matched those should maintain them but the chevron markers are always removed. How can I keep the chevron markers from being removed or what is the best way to delimit beginning and end of sentence with quanteda ? As a bonus question what is

Quanteda package, Naive Bayes: How can I predict on different-featured test data?

阅读更多关于 Quanteda package, Naive Bayes: How can I predict on different-featured test data?

问题 I used quanteda::textmodel_NB to create a model that categorizes text into one of two categories. I fit the model on a training data set of data from last summer. Now, I am trying to use it this summer to categorize new text we get here at work. I tried doing this and got the following error: Error in predict.textmodel_NB_fitted(model, test_dfm) : feature set in newdata different from that in training set The code in the function that generates the error can be found here at lines 157 to 165.