I\'m trying to use the packages quanteda
and caret
together to classify text based on a trained sample. As a test run, I wanted to compare the buil
The answer is that caret (which uses naive_bayes
from the naivebayes package) assumes a Gaussian distribution, whereas quanteda::textmodel_nb()
is based on a more text-appropriate multinomial distribution (with the option of a Bernoulli distribution as well).
The documentation for textmodel_nb()
replicates the example from the IIR book (Manning, Raghavan, and Schütze 2008) and a further example from Jurafsky and Martin (2018) is also referenced. See:
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. An Introduction to Information Retrieval. Cambridge University Press (Chapter 13). https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Jurafsky, Daniel, and James H. Martin. 2018. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of 3rd edition, September 23, 2018 (Chapter 4). https://web.stanford.edu/~jurafsky/slp3/4.pdf
Another package, e1071, produces the same results you found as it is also based on a Gaussian distribution.
library("e1071")
nb_e1071 <- naiveBayes(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
table(actual_class, nb_e1071_pred)
## nb_e1071_pred
## actual_class neg pos
## neg 246 3
## pos 249 2
However both caret and e1071 work on dense matrices, which is one reason they are so mind-numbingly slow compared to the quanteda approach which operates on the sparse dfm. So from the standpoint of appropriateness, efficiency, and (as per your results) the performance of the classifier, it should be pretty clear which one is preferred!
library("rbenchmark")
benchmark(
quanteda = {
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
},
caret = {
nb_caret <- train(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")),
method = "naive_bayes",
trControl = trainControl(method = "none"),
tuneGrid = data.frame(laplace = 1,
usekernel = FALSE,
adjust = FALSE),
verbose = FALSE)
predicted_class_caret <- predict(nb_caret, newdata = test_m)
},
e1071 = {
nb_e1071 <- naiveBayes(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
},
replications = 1
)
## test replications elapsed relative user.self sys.self user.child sys.child
## 2 caret 1 29.042 123.583 25.896 3.095 0 0
## 3 e1071 1 217.177 924.157 215.587 1.169 0 0
## 1 quanteda 1 0.235 1.000 0.213 0.023 0 0
The above answer is correct, I just wanted to add that you can use a Bernoulli distribution with both the 'naivebayes' and 'e1071' package by turning your variables into factors. The output of these should match the 'quanteda' textmodel_nb with a Bernoulli distribution.
Moreover, you could check out: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. This implements a Bernoulli, Multinomial, and Gaussian distribution, works with sparse matrices and is blazingly fast (Fastest currently on CRAN).