IndexError when trying to update gensim's LdaModel

ⅰ亾dé卋堺 提交于 2020-12-26 11:04:20

问题


I am facing the following error when trying to update my gensim's LdaModel:

IndexError: index 6614 is out of bounds for axis 1 with size 6614

I checked why were other people having this issue on this thread, but I am using the same dictionary from the beginning to the end, which was their error.

As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : 

 

 fr_documents_lda = open("documents_lda_40_rails_30_ruby_full.dat", 'rb')
 dictionary = Dictionary()
 chunk_no = 0
 while 1:
     try:
         t0 = time()
         documents_lda = pickle.load(fr_documents_lda)
         chunk_no += 1
         dictionary.add_documents(documents_lda)
         t1 = time()
         print("Chunk number {0} took {1:.2f}s".format(chunk_no, t1-t0))
     except EOFError:
         print("Finished going through pickle")
         break

Once built for the whole dataset, I am training the model in the same fashion, iteratively, this way :

fr_documents_lda = open("documents_lda_40_rails_30_ruby_full.dat", 'rb')
first_iter = True
chunk_no = 0
lda_gensim = None
while 1:
    try:
        t0 = time()
        documents_lda = pickle.load(fr_documents_lda) 
        chunk_no += 1
        corpus = [dictionary.doc2bow(text) for text in documents_lda]
        if first_iter:
            first_iter = False
            lda_gensim = LdaModel(corpus, num_topics=no_topics, iterations=100, offset=50., random_state=0, alpha='auto')
        else:
            lda_gensim.update(corpus)
        t1 = time()
        print("Chunk number {0} took {1:.2f}s".format(chunk_no, t1-t0))
    except EOFError:
        print("Finished going through pickle")
        break

I also tried updating the dictionary at every chunk, i.e. having  

dictionary.add_documents(documents_lda)

right before  

corpus = [dictionary.doc2bow(text) for text in documents_lda]

 in the last piece of code. Finally, I tried setting the allow_update argument of doc2bow to True. Nothing works.

FYI, the size of my final dictionary is 85k. The size of my dictionary built only from the first chunk is 10k. The error occurs on the second iteration, when it passes in the else condition, when calling the update method.

The error is raised by the line expElogbetad = self.expElogbeta[:, ids] , called by gamma, sstats = self.inference(chunk, collect_sstats=True), itself called by gammat = self.do_estep(chunk, other), itself called by lda_gensim.update(corpus).

Is anyone having an idea on how to fix this, or what is happening ?

Thank you in advance.


回答1:


The solution is simply to initialize the LdaModel with the argument id2word = dictionary.

If you don't do that, it assumes that your vocabulary size is the vocabulary size of the first set of documents you train it on, and can't update it. In fact, it sets its num_terms value to the length of id2word once there, and never updates it afterwards (you can verify in the update function).



来源:https://stackoverflow.com/questions/50214899/indexerror-when-trying-to-update-gensims-ldamodel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!