quanteda | 易学教程

R: problems applying LIME to quanteda text model

阅读更多关于 R: problems applying LIME to quanteda text model

问题 it's a modified version of my previous question: I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data. I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") # creating corpus and dfm for train and test sets get_matrix <- function(df){ corpus <- quanteda::corpus(df) dfm

R: problems applying LIME to quanteda text model

阅读更多关于 R: problems applying LIME to quanteda text model

Lemmatization using txt file with lemmes in R

阅读更多关于 Lemmatization using txt file with lemmes in R

问题 I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance

Computing cosine similarities on a large corpus in R using quanteda

阅读更多关于 Computing cosine similarities on a large corpus in R using quanteda

问题 I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case). I am already running R 64 bit on a server with lots of memory. I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error). Is there a

Concatenate dfm matrices in 'quanteda' package

阅读更多关于 Concatenate dfm matrices in 'quanteda' package

问题 Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any. An example: dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE) dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE) rbind(dfm1, dfm2) gives an error. The 'tm' package can

quanteda kwic regex operation

阅读更多关于 quanteda kwic regex operation

问题 Further edit to original question . Question originated by expectation that regexes would work identically or nearly to "grep" or to some programming language. This below is what I expected and the fact that it did not happen generated my question (using cygwin): echo "regex unusual operation will deport into a different" > out.txt grep "will * dep" out.txt "regex unusual operation will deport into a different" Originary question Trying to follow https://github.com/kbenoit/ITAUR/blob/master

Naive Bayes in Quanteda vs caret: wildly different results

阅读更多关于 Naive Bayes in Quanteda vs caret: wildly different results

问题 I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret . However, I can't seem to get caret to work right. Here is some code for reproduction. First on the quanteda side: library(quanteda) library(quanteda.corpora) library(caret) corp <- data_corpus_movies set.seed(300) id_train <- sample(docnames(corp), size = 1500, replace = FALSE) # get

Pairwise Distance between documents

阅读更多关于 Pairwise Distance between documents

问题 I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix. A <- data.frame(name = c( "X-ray right leg arteries", "x-ray left shoulder", "x-ray leg arteries", "x-ray leg with 20km distance" ), stringsAsFactors = F) B <- data.frame(name = c( "X-ray left leg arteries", "X-ray leg", "xray right leg", "X-ray right leg arteries" ), stringsAsFactors = F) corp1 <- corpus(A, text_field = "name") corp2 <- corpus(B, text_field = "name") docnames

Split up ngrams in (sparse) document-feature matrix

阅读更多关于 Split up ngrams in (sparse) document-feature matrix

问题 This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams. For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction"). library(quanteda) eg.txt <- c('increase in_the great

Interpretation of dfm_weight(scheme='prop') with groups (quanteda)

阅读更多关于 Interpretation of dfm_weight(scheme='prop') with groups (quanteda)

问题 I'm looking at the different weighting options using the dfm_weight. If I select scheme = 'prop' and I group textstat_frequency by location , what's the proper interpretation of a word in each group? Say in New York the term career is 0.6 and and in Boston the word team is 4.0, how can I interpret these numbers? corp=corpus(df,text_field = "What are the areas that need the most improvement at our company?") %>% dfm(remove_numbers=T,remove_punct=T,remove=c(toRemove,stopwords('english')),ngrams