quanteda | 易学教程

R: problems applying LIME to quanteda text model

阅读更多关于 R: problems applying LIME to quanteda text model

it's a modified version of my previous question : I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data . I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong : library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") # creating corpus and dfm for train and test sets get_matrix <- function(df){ corpus <- quanteda::corpus(df) dfm <- quanteda::dfm(corpus, remove_url = TRUE, remove_punct = TRUE, remove = stopwords("english")) }

Logical combinations in quanteda dictionaries

阅读更多关于 Logical combinations in quanteda dictionaries

I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words. For example: Teddybear = (fluffy AND adorable AND soft) Is this possible? I only found a solution yet to test for phrases like (Teddybear = (soft fluffy adorable)) . But then it has to be the exact phrase match in the text. But how can I get results neglecting the order of the words? This is not currently something that is directly possible in quanteda (v1.2.0). However, there are workarounds in which you create dictionary sequences that are permutations of your desired

R Text Mining with quanteda

阅读更多关于 R Text Mining with quanteda

I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries # Define as corpus fb_corp <-corpus(fb_test) class(fb_corp) # LIWC

R: sparse matrix multiplication with data.table and quanteda package?

阅读更多关于 R: sparse matrix multiplication with data.table and quanteda package?

I am trying to create a matrix mulptiplication with sparse matrix and with the package called quanteda, utilising data.table package, related to this thread here . So require(quanteda) mytext <- c("Let the big dogs hunt", "No holds barred", "My child is an honor student") myMatrix <-dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE) #a data.table as.matrix(myMatrix) %*% transpose(as.matrix(myMatrix)) how can you get the matrix multiplication working here with quanteda package and sparse matrices? This works just fine: mytext <- c("Let the big dogs hunt", "No holds barred", "My

A lemmatizing function using a hash dictionary does not work with tm package in R

阅读更多关于 A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code: library(hashmap) library(data.table) txt = "Abadan Abadanem Abadan

Lemmatization using txt file with lemmes in R

阅读更多关于 Lemmatization using txt file with lemmes in R

I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/ ) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance abadanka abadanek abadanka abadanką abadanka abadankach abadanka abadankami What packages and with

Working with text classification and big sparse matrices in R

阅读更多关于 Working with text classification and big sparse matrices in R

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language. I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm , where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build

Can the ANEW dictionary be used for sentiment analysis in quanteda?

阅读更多关于 Can the ANEW dictionary be used for sentiment analysis in quanteda?

I am trying to find a way to implement the Affective Norms for English Words (in dutch) for a longitudinal sentiment analysis with Quanteda. What I ultimately want to have is a "mean sentiment" per year in order to show any longitudinal trends. In the data-set all words a scored on a 7-point Likert-scale by 64 coders on four categories, which provides a mean for each word. What I want to do is take one of the dimensions and use this to analyse changes in emotions over time. I realise that Quanteda has a function for implementing the LIWC-dictionary, but I would prefer using the open-source

Create dfm step by step with quanteda

阅读更多关于 Create dfm step by step with quanteda

问题 I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm . I want to proceed step by step instead of using the automated way with dfm() . I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures. I would like this sequence to be implemented: 1) remove the

R: LIME returns error on different feature numbers when it's not the case

阅读更多关于 R: LIME returns error on different feature numbers when it's not the case

I'm building a text classifier of Clinton & Trump tweets (data can be found on Kaggle ). I'm doing EDA and modelling using quanteda package: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") tweet_data <- tweet_csv %>% select(author = handle, text, retweet_count, favorite_count, source_url, timestamp = time) %>% mutate(date = as_date(str_sub(timestamp, 1, 10)), hour = hour(hms(str_sub(timestamp, 12, 19))), tweet_num = row_number()) %>% select(-timestamp) # creating corpus and dfm tweet_corpus <- corpus(tweet_data) edited_dfm <- dfm