Naive Bayes in Quanteda vs caret: wildly different results

后端 未结 2 562
我寻月下人不归
我寻月下人不归 2021-01-01 04:51

I\'m trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the buil

相关标签:
2条回答
  • 2021-01-01 05:08

    The answer is that caret (which uses naive_bayes from the naivebayes package) assumes a Gaussian distribution, whereas quanteda::textmodel_nb() is based on a more text-appropriate multinomial distribution (with the option of a Bernoulli distribution as well).

    The documentation for textmodel_nb() replicates the example from the IIR book (Manning, Raghavan, and Schütze 2008) and a further example from Jurafsky and Martin (2018) is also referenced. See:

    • Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. An Introduction to Information Retrieval. Cambridge University Press (Chapter 13). https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

    • Jurafsky, Daniel, and James H. Martin. 2018. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of 3rd edition, September 23, 2018 (Chapter 4). https://web.stanford.edu/~jurafsky/slp3/4.pdf

    Another package, e1071, produces the same results you found as it is also based on a Gaussian distribution.

    library("e1071")
    nb_e1071 <- naiveBayes(x = training_m,
                           y = as.factor(docvars(training_dfm, "Sentiment")))
    nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
    table(actual_class, nb_e1071_pred)
    ##             nb_e1071_pred
    ## actual_class neg pos
    ##          neg 246   3
    ##          pos 249   2
    

    However both caret and e1071 work on dense matrices, which is one reason they are so mind-numbingly slow compared to the quanteda approach which operates on the sparse dfm. So from the standpoint of appropriateness, efficiency, and (as per your results) the performance of the classifier, it should be pretty clear which one is preferred!

    library("rbenchmark")
    benchmark(
        quanteda = { 
            nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
            predicted_class <- predict(nb_quanteda, newdata = test_dfm)
        },
        caret = {
            nb_caret <- train(x = training_m,
                              y = as.factor(docvars(training_dfm, "Sentiment")),
                              method = "naive_bayes",
                              trControl = trainControl(method = "none"),
                              tuneGrid = data.frame(laplace = 1,
                                                    usekernel = FALSE,
                                                    adjust = FALSE),
                              verbose = FALSE)
            predicted_class_caret <- predict(nb_caret, newdata = test_m)
        },
        e1071 = {
            nb_e1071 <- naiveBayes(x = training_m,
                           y = as.factor(docvars(training_dfm, "Sentiment")))
            nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
        },
        replications = 1
    )
    ##       test replications elapsed relative user.self sys.self user.child sys.child
    ## 2    caret            1  29.042  123.583    25.896    3.095          0         0
    ## 3    e1071            1 217.177  924.157   215.587    1.169          0         0
    ## 1 quanteda            1   0.235    1.000     0.213    0.023          0         0
    
    0 讨论(0)
  • 2021-01-01 05:29

    The above answer is correct, I just wanted to add that you can use a Bernoulli distribution with both the 'naivebayes' and 'e1071' package by turning your variables into factors. The output of these should match the 'quanteda' textmodel_nb with a Bernoulli distribution.

    Moreover, you could check out: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. This implements a Bernoulli, Multinomial, and Gaussian distribution, works with sparse matrices and is blazingly fast (Fastest currently on CRAN).

    0 讨论(0)
提交回复
热议问题