Topic models: cross validation with loglikelihood or perplexity

前端 未结 2 659
野性不改
野性不改 2020-12-07 10:15

I\'m clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.

2条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-12-07 10:53

    I wrote the answer on CV that you refer to, here's a bit more detail:

    1. seq(2, 100, by =1) simply creates a number sequence from 2 to 100 by ones, so 2, 3, 4, 5, ... 100. Those are the numbers of topics that I want to use in the models. One model with 2 topics, another with 3 topics, another with 4 topics and so on to 100 topics.

    2. AssociatedPress[21:30] is simply a subset of the built-in data in the topicmodels package. I just used a subset in that example so that it would run faster.

    Regarding the general question of optimal topic numbers, I now follow the example of Martin Ponweiser on Model Selection by Harmonic Mean (4.3.3 in his thesis, which is here: http://epub.wu.ac.at/3558/1/main.pdf). Here's how I do it at the moment:

    library(topicmodels)
    #
    # get some of the example data that's bundled with the package
    #
    data("AssociatedPress", package = "topicmodels")
    
    harmonicMean <- function(logLikelihoods, precision=2000L) {
      library("Rmpfr")
      llMed <- median(logLikelihoods)
      as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
                                           prec = precision) + llMed))))
    }
    
    # The log-likelihood values are then determined by first fitting the model using for example
    k = 20
    burnin = 1000
    iter = 1000
    keep = 50
    
    fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
    
    # where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:
    
    logLiks <- fitted@logLiks[-c(1:(burnin/keep))]
    
    # assuming that burnin is a multiple of keep and
    
     harmonicMean(logLiks)
    

    So to do this over a sequence of topic models with different numbers of topics...

    # generate numerous topic models with different numbers of topics
    sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
    fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
    
    # extract logliks from each topic
    logLiks_many <- lapply(fitted_many, function(L)  L@logLiks[-c(1:(burnin/keep))])
    
    # compute harmonic means
    hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
    
    # inspect
    plot(sequ, hm_many, type = "l")
    
    # compute optimum number of topics
    sequ[which.max(hm_many)]
    ## 6
    

    enter image description here Here's the output, with numbers of topics along the x-axis, indicating that 6 topics is optimum.

    Cross-validation of topic models is pretty well documented in the docs that come with the package, see here for example: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf Give that a try and then come back with a more specific question about coding CV with topic models.

提交回复
热议问题