quanteda | 易学教程

Form bigrams without stopwords in R

阅读更多关于 Form bigrams without stopwords in R

问题 I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining". Let's say if I have a string as follows: "IBM have a great success in the computer industry for the past decades..." After removing stopwords("have","a","in","the","for"), "IBM great success computer industry past decades..." In a result, bigrams like "success computer" or "industry past" will occur. But what I really need is

QUANTEDA - invalid class “dfmSparse” object

阅读更多关于 QUANTEDA - invalid class “dfmSparse” object

问题 I get this warning-message. I use these data: https://github.com/kbenoit/quanteda/tree/master/data/data_char_inaugural.RData RStudio version: Version 1.0.136 – © 2009-2016 RStudio, Inc. library(quanteda) uk2010immigCorpus <- corpus(data_char_ukimmig2010, docvars = data.frame(party = names(data_char_ukimmig2010)),metacorpus = list(notes = "Immigration-related sections of 2010 UK party manifestos")) mydfm <- dfm(uk2010immigCorpus, remove = c("will", stopwords("english")),removePunct = TRUE)

QUANTEDA - invalid class “dfmSparse” object

阅读更多关于 QUANTEDA - invalid class “dfmSparse” object

R Text Mining with quanteda

阅读更多关于 R Text Mining with quanteda

问题 I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment

Quanteda: Fastest way to replace tokens with lemma from dictionary?

阅读更多关于 Quanteda: Fastest way to replace tokens with lemma from dictionary?

问题 Is there a much faster alternative to R quanteda::tokens_lookup()? I use tokens() in the 'quanteda' R package to tokenize a data frame with 2000 documents. Each document is 50 - 600 words. This takes a couple of seconds on my PC (Microsoft R Open 3.4.1, Intel MKL (using 2 cores)). I have a dictionary object, made from a data frame of nearly 600 000 words (TERMS) and their corresponding lemma (PARENT). There are 80 000 distinct lemmas. I use tokens_lookup() to replace the elements in the token

Working with text classification and big sparse matrices in R

阅读更多关于 Working with text classification and big sparse matrices in R

问题 I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language. I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm , where creating a DocumentTermMatrix with a dictionary, makes

R: removal of regex from Quanteda DFM, Sparse Document-Feature Matrix, object?

阅读更多关于 R: removal of regex from Quanteda DFM, Sparse Document-Feature Matrix, object?

问题 Quanteda package provides the sparse document-feature matrix DFM and its methods contain removeFeatures. I have tried dfm(x, removeFeatures="\\b[a-z]{1-3}\\b") to remove too short words as well as dfm(x, keptFeatures="\\b[a-z]{4-99}\\b") to preserve sufficiently long words but not working, basically doing the same thing i.e. removing too short words. How can I remove a regex match from a Quanteda DFM object? Example. myMatrix <-dfm(myData, ignoredFeatures = stopwords("english"), stem = TRUE,

Can the ANEW dictionary be used for sentiment analysis in quanteda?

阅读更多关于 Can the ANEW dictionary be used for sentiment analysis in quanteda?

问题 I am trying to find a way to implement the Affective Norms for English Words (in dutch) for a longitudinal sentiment analysis with Quanteda. What I ultimately want to have is a "mean sentiment" per year in order to show any longitudinal trends. In the data-set all words a scored on a 7-point Likert-scale by 64 coders on four categories, which provides a mean for each word. What I want to do is take one of the dimensions and use this to analyse changes in emotions over time. I realise that

Quanteda - Extracting identified dictionary words

阅读更多关于 Quanteda - Extracting identified dictionary words

问题 I am trying to extract the identified dictionary words from a Quanteda dfm, but have been unable to find a solution. Does someone have a solution for this? Sample input: dict <- dictionary(list(season = c("spring", "summer", "fall", "winter"))) dfm <- dfm("summer is great", dictionary = dict) Output: > dfm Document-feature matrix of: 1 document, 1 feature. 1 x 1 sparse Matrix of class "dfmSparse" features docs season text1 1 I now know that a seasonality dict word has been identified in the

Assigning weights to different features in R

阅读更多关于 Assigning weights to different features in R

问题 Is it possible to assign weights to different features before formulating a DFM in R? Consider this example in R str="apple is better than banana" mydfm=dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE) DFM mydfm looks like: docs apple better banana text1 1 1 1 But, I want to assign weights(apple:5, banana:3) beforehand, so that DFM mydfm looks like: docs apple better banana text1 5 1 3 回答1: I don't think so, however you can easily do it afterwards: library(quanteda) str <-