quanteda

R: problems applying LIME to quanteda text model

ε祈祈猫儿з 提交于 2019-12-06 05:20:15
it's a modified version of my previous question : I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data . I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong : library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") # creating corpus and dfm for train and test sets get_matrix <- function(df){ corpus <- quanteda::corpus(df) dfm <- quanteda::dfm(corpus, remove_url = TRUE, remove_punct = TRUE, remove = stopwords("english")) }

Logical combinations in quanteda dictionaries

落爺英雄遲暮 提交于 2019-12-06 00:11:59
I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words. For example: Teddybear = (fluffy AND adorable AND soft) Is this possible? I only found a solution yet to test for phrases like (Teddybear = (soft fluffy adorable)) . But then it has to be the exact phrase match in the text. But how can I get results neglecting the order of the words? This is not currently something that is directly possible in quanteda (v1.2.0). However, there are workarounds in which you create dictionary sequences that are permutations of your desired

R Text Mining with quanteda

谁说胖子不能爱 提交于 2019-12-04 22:41:32
I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries # Define as corpus fb_corp <-corpus(fb_test) class(fb_corp) # LIWC

R: sparse matrix multiplication with data.table and quanteda package?

跟風遠走 提交于 2019-12-04 21:02:33
I am trying to create a matrix mulptiplication with sparse matrix and with the package called quanteda, utilising data.table package, related to this thread here . So require(quanteda) mytext <- c("Let the big dogs hunt", "No holds barred", "My child is an honor student") myMatrix <-dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE) #a data.table as.matrix(myMatrix) %*% transpose(as.matrix(myMatrix)) how can you get the matrix multiplication working here with quanteda package and sparse matrices? This works just fine: mytext <- c("Let the big dogs hunt", "No holds barred", "My

A lemmatizing function using a hash dictionary does not work with tm package in R

孤街醉人 提交于 2019-12-04 20:18:40
I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code: library(hashmap) library(data.table) txt = "Abadan Abadanem Abadan

Lemmatization using txt file with lemmes in R

假装没事ソ 提交于 2019-12-04 19:47:56
I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/ ) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance abadanka abadanek abadanka abadanką abadanka abadankach abadanka abadankami What packages and with

Working with text classification and big sparse matrices in R

不羁的心 提交于 2019-12-04 15:44:51
I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language. I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm , where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build

Can the ANEW dictionary be used for sentiment analysis in quanteda?

六眼飞鱼酱① 提交于 2019-12-03 22:07:19
I am trying to find a way to implement the Affective Norms for English Words (in dutch) for a longitudinal sentiment analysis with Quanteda. What I ultimately want to have is a "mean sentiment" per year in order to show any longitudinal trends. In the data-set all words a scored on a 7-point Likert-scale by 64 coders on four categories, which provides a mean for each word. What I want to do is take one of the dimensions and use this to analyse changes in emotions over time. I realise that Quanteda has a function for implementing the LIWC-dictionary, but I would prefer using the open-source

Create dfm step by step with quanteda

末鹿安然 提交于 2019-12-03 14:02:48
问题 I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm . I want to proceed step by step instead of using the automated way with dfm() . I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures. I would like this sequence to be implemented: 1) remove the

R: LIME returns error on different feature numbers when it's not the case

心不动则不痛 提交于 2019-12-01 12:08:44
I'm building a text classifier of Clinton & Trump tweets (data can be found on Kaggle ). I'm doing EDA and modelling using quanteda package: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") tweet_data <- tweet_csv %>% select(author = handle, text, retweet_count, favorite_count, source_url, timestamp = time) %>% mutate(date = as_date(str_sub(timestamp, 1, 10)), hour = hour(hms(str_sub(timestamp, 12, 19))), tweet_num = row_number()) %>% select(-timestamp) # creating corpus and dfm tweet_corpus <- corpus(tweet_data) edited_dfm <- dfm