quanteda

R: initialise empty dgCMatrix given by matrix multiplication of two Quanteda DFM sparse matrices?

隐身守侯 提交于 2019-12-10 11:57:03
问题 I have for loop like this, trying to implement the solution here, with dummy vars such that aaa <- DFM %*% t(DFM) #DFM is Quanteda dfm-sparse-matrix for(i in 1:nrow(aaa)) aaa[i,] <- aaa[i,][order(aaa[i,], decreasing = TRUE)] but now for(i in 1:nrow(mmm)) mmm[i,] <- aaa[i,][order(aaa[i,], decreasing = TRUE)] where mmm does not exist yet, the goal is to do the same thing as mmm <- t(apply(a, 1, sort, decreasing = TRUE)) . But now before the for loop I need to initialise the mmm otherwise Error:

R: sparse matrix multiplication with data.table and quanteda package?

给你一囗甜甜゛ 提交于 2019-12-10 00:52:24
问题 I am trying to create a matrix mulptiplication with sparse matrix and with the package called quanteda, utilising data.table package, related to this thread here. So require(quanteda) mytext <- c("Let the big dogs hunt", "No holds barred", "My child is an honor student") myMatrix <-dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE) #a data.table as.matrix(myMatrix) %*% transpose(as.matrix(myMatrix)) how can you get the matrix multiplication working here with quanteda package and

A lemmatizing function using a hash dictionary does not work with tm package in R

佐手、 提交于 2019-12-09 23:45:13
问题 I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm.

R: removeCommonTerms with Quanteda package?

拟墨画扇 提交于 2019-12-08 06:01:32
问题 The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } now I would like to remove too common terms with the Quanteda package. I

Remove ngrams with leading and trailing stopwords

↘锁芯ラ 提交于 2019-12-07 16:29:32
问题 I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords. I have about 100 pdf files. I converted them to plain-text files through an Adobe batch command and collected them within a single directory. From there I use R. (It's a patchwork of code because I'm just getting started with text mining.) My code: library(tm) # Make path for sub-dir which contains corpus files path <- file.path(getwd(),

R: removeCommonTerms with Quanteda package?

蓝咒 提交于 2019-12-07 15:09:38
The removeCommonTerms function is found here for the TM package such that removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } now I would like to remove too common terms with the Quanteda package. I could do this removal before creating the Document-feature matrix or with the document-feature matrix.

Logical combinations in quanteda dictionaries

那年仲夏 提交于 2019-12-07 12:55:12
问题 I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words. For example: Teddybear = (fluffy AND adorable AND soft) Is this possible? I only found a solution yet to test for phrases like (Teddybear = (soft fluffy adorable)) . But then it has to be the exact phrase match in the text. But how can I get results neglecting the order of the words? 回答1: This is not currently something that is directly possible in quanteda (v1.2.0).

QUANTEDA - invalid class “dfmSparse” object

别等时光非礼了梦想. 提交于 2019-12-06 14:18:44
I get this warning-message. I use these data: https://github.com/kbenoit/quanteda/tree/master/data/data_char_inaugural.RData RStudio version: Version 1.0.136 – © 2009-2016 RStudio, Inc. library(quanteda) uk2010immigCorpus <- corpus(data_char_ukimmig2010, docvars = data.frame(party = names(data_char_ukimmig2010)),metacorpus = list(notes = "Immigration-related sections of 2010 UK party manifestos")) mydfm <- dfm(uk2010immigCorpus, remove = c("will", stopwords("english")),removePunct = TRUE) Error in validObject(.Object) : invalid class “dfmSparse” object: superclass "replValueSp" not defined in

How to keep the beginning and end of sentence markers with quanteda

a 夏天 提交于 2019-12-06 12:08:25
问题 I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below. I thought that using the keptFeatures with a regular expression that matched those should maintain them but the chevron markers are always removed. How can I keep the chevron markers from being removed or what is the best way to delimit beginning and end of sentence with quanteda ? As a bonus question what is

Quanteda package, Naive Bayes: How can I predict on different-featured test data?

点点圈 提交于 2019-12-06 09:21:06
问题 I used quanteda::textmodel_NB to create a model that categorizes text into one of two categories. I fit the model on a training data set of data from last summer. Now, I am trying to use it this summer to categorize new text we get here at work. I tried doing this and got the following error: Error in predict.textmodel_NB_fitted(model, test_dfm) : feature set in newdata different from that in training set The code in the function that generates the error can be found here at lines 157 to 165.