gensim | 易学教程

IndexError when trying to update gensim's LdaModel

阅读更多关于 IndexError when trying to update gensim's LdaModel

问题 I am facing the following error when trying to update my gensim's LdaModel: IndexError: index 6614 is out of bounds for axis 1 with size 6614 I checked why were other people having this issue on this thread, but I am using the same dictionary from the beginning to the end, which was their error. As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : fr_documents_lda = open("documents

gensim most_similar with positive and negative, how does it work?

阅读更多关于 gensim most_similar with positive and negative, how does it work?

问题 I was reading this answer That says about Gensim most_similar : it performs vector arithmetic: adding the positive vectors, subtracting the negative, then from that resulting position, listing the known-vectors closest to that angle. But when I tested it, that is not the case. I trained a Word2Vec with Gensim "text8" dataset and tested these two: model.most_similar(positive=['woman', 'king'], negative=['man']) >>> [('queen', 0.7131118178367615), ('prince', 0.6359186768531799),...] model.wv

gensim most_similar with positive and negative, how does it work?

阅读更多关于 gensim most_similar with positive and negative, how does it work?

Transforming a gensim.interfaces.TransformedCorpus to a readable result

阅读更多关于 Transforming a gensim.interfaces.TransformedCorpus to a readable result

问题 I am using the the Mallet LDA with gensims implemented wrapper. Now I want to get the Topic distribution of several unseen documents, store it in a nested list and then print it out. This is my code: other_texts = [ ['wlan', 'usb', 'router'], ['auto', 'auto', 'auto'], ['human', 'system', 'computer'] ] corpus1 = [id2word.doc2bow(text) for text in other_texts] to_pro = [] for t in corpus1: unseen_doc = corpus1 vector = lda[unseen_doc] # get topic probability distribution for a document to_pro

Doc2Vec Get most similar documents

阅读更多关于 Doc2Vec Get most similar documents

问题 I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so far import gensim, re import pandas as pd # TOKENIZER def tokenizer(input_string): return re.findall(r"[\w']+", input_string) # IMPORT DATA data = pd