gensim

Is there a Python library or tool that analyzes two bodies of text for similarities in order to provide recommendations?

别来无恙 提交于 2021-02-08 06:38:36
问题 First, apologies for being long-winded. I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you

Efficient transformation of gensim TransformedCorpus data to array

懵懂的女人 提交于 2021-02-07 10:13:30
问题 Is there a more direct or efficient method for getting the topic probabilities data from a gensim.interfaces.TransformedCorpus object into a numpy array (or alternatively, pandas dataframe) than the by-row method below? from gensim import models import numpy as np num_topics = 5 model = models.LdaMulticore(corpus, num_topics=num_topics, minimum_probability=0.0) all_topics = model.get_document_topics(corpus) num_docs = len(all_topics) lda_scores = np.empty([num_docs, num_topics]) for i in

Efficient transformation of gensim TransformedCorpus data to array

℡╲_俬逩灬. 提交于 2021-02-07 10:12:27
问题 Is there a more direct or efficient method for getting the topic probabilities data from a gensim.interfaces.TransformedCorpus object into a numpy array (or alternatively, pandas dataframe) than the by-row method below? from gensim import models import numpy as np num_topics = 5 model = models.LdaMulticore(corpus, num_topics=num_topics, minimum_probability=0.0) all_topics = model.get_document_topics(corpus) num_docs = len(all_topics) lda_scores = np.empty([num_docs, num_topics]) for i in

python IndexError using gensim for LDA Topic Modeling

ε祈祈猫儿з 提交于 2021-02-07 09:28:34
问题 Another thread has a similar question to mine but leaves out reproducible code. The goal with the script in question is to create a process that is as memory efficient as possible. So I tried to write a the class corpus() to take advantage of gensims' capabilities. However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . The documents that I am

python IndexError using gensim for LDA Topic Modeling

与世无争的帅哥 提交于 2021-02-07 09:26:59
问题 Another thread has a similar question to mine but leaves out reproducible code. The goal with the script in question is to create a process that is as memory efficient as possible. So I tried to write a the class corpus() to take advantage of gensims' capabilities. However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . The documents that I am

Cosine similarity between 0 and 1

末鹿安然 提交于 2021-02-06 11:52:33
问题 I am interested in calculating similarity between vectors, however this similarity has to be a number between 0 and 1. There are many questions concerning tf-idf and cosine similarity, all indicating that the value lies between 0 and 1. From Wikipedia: In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (using tf–idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

荒凉一梦 提交于 2021-02-06 09:26:09
问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

天大地大妈咪最大 提交于 2021-02-06 09:26:09
问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

我的梦境 提交于 2021-02-06 02:35:47
问题 I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> len(corpusA) 1113 >>> len(corpusB) 17 Why is the number of topics independent of corpus length? 回答1: @user3907335 is exactly correct here: HDP will calculate as many topics as the

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

痴心易碎 提交于 2021-02-06 02:30:02
问题 I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> len(corpusA) 1113 >>> len(corpusB) 17 Why is the number of topics independent of corpus length? 回答1: @user3907335 is exactly correct here: HDP will calculate as many topics as the