quanteda

R: problems applying LIME to quanteda text model

时光毁灭记忆、已成空白 提交于 2020-01-14 02:43:40
问题 it's a modified version of my previous question: I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data. I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") # creating corpus and dfm for train and test sets get_matrix <- function(df){ corpus <- quanteda::corpus(df) dfm

R: problems applying LIME to quanteda text model

梦想与她 提交于 2020-01-14 02:43:12
问题 it's a modified version of my previous question: I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data. I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") # creating corpus and dfm for train and test sets get_matrix <- function(df){ corpus <- quanteda::corpus(df) dfm

Lemmatization using txt file with lemmes in R

↘锁芯ラ 提交于 2020-01-13 06:42:25
问题 I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance

Computing cosine similarities on a large corpus in R using quanteda

断了今生、忘了曾经 提交于 2020-01-07 03:04:47
问题 I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case). I am already running R 64 bit on a server with lots of memory. I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error). Is there a

Concatenate dfm matrices in 'quanteda' package

∥☆過路亽.° 提交于 2020-01-06 03:06:51
问题 Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any. An example: dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE) dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE) rbind(dfm1, dfm2) gives an error. The 'tm' package can

quanteda kwic regex operation

六月ゝ 毕业季﹏ 提交于 2020-01-05 06:47:59
问题 Further edit to original question . Question originated by expectation that regexes would work identically or nearly to "grep" or to some programming language. This below is what I expected and the fact that it did not happen generated my question (using cygwin): echo "regex unusual operation will deport into a different" > out.txt grep "will * dep" out.txt "regex unusual operation will deport into a different" Originary question Trying to follow https://github.com/kbenoit/ITAUR/blob/master

Naive Bayes in Quanteda vs caret: wildly different results

眉间皱痕 提交于 2020-01-01 12:23:31
问题 I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret . However, I can't seem to get caret to work right. Here is some code for reproduction. First on the quanteda side: library(quanteda) library(quanteda.corpora) library(caret) corp <- data_corpus_movies set.seed(300) id_train <- sample(docnames(corp), size = 1500, replace = FALSE) # get

Pairwise Distance between documents

◇◆丶佛笑我妖孽 提交于 2019-12-24 19:15:56
问题 I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix. A <- data.frame(name = c( "X-ray right leg arteries", "x-ray left shoulder", "x-ray leg arteries", "x-ray leg with 20km distance" ), stringsAsFactors = F) B <- data.frame(name = c( "X-ray left leg arteries", "X-ray leg", "xray right leg", "X-ray right leg arteries" ), stringsAsFactors = F) corp1 <- corpus(A, text_field = "name") corp2 <- corpus(B, text_field = "name") docnames

Split up ngrams in (sparse) document-feature matrix

别等时光非礼了梦想. 提交于 2019-12-24 08:30:10
问题 This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams. For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction"). library(quanteda) eg.txt <- c('increase in_the great

Interpretation of dfm_weight(scheme='prop') with groups (quanteda)

荒凉一梦 提交于 2019-12-24 03:37:12
问题 I'm looking at the different weighting options using the dfm_weight. If I select scheme = 'prop' and I group textstat_frequency by location , what's the proper interpretation of a word in each group? Say in New York the term career is 0.6 and and in Boston the word team is 4.0, how can I interpret these numbers? corp=corpus(df,text_field = "What are the areas that need the most improvement at our company?") %>% dfm(remove_numbers=T,remove_punct=T,remove=c(toRemove,stopwords('english')),ngrams