gensim | 易学教程

Spark and Python trying to parse wikipedia using gensim

阅读更多关于 Spark and Python trying to parse wikipedia using gensim

问题 Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and then using my or from some library custom functions. Now I am particularly trying to parse the wikipedia dump using gensim framework. I have already installed gensim on my master node and all my worker nodes and now I would like to use gensim build in function for parsing wikipedia pages inspired by this question List

Doc2vec : TaggedLineDocument()

阅读更多关于 Doc2vec : TaggedLineDocument()

问题 So,I'm trying to learn and understand Doc2Vec. I'm following this tutorial. My input is a list of documents i.e list of lists of words. This is what my code looks like: input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...] documents = TaggedLineDocument(input) model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2) But I am getting some unicode error(tried googling this error, but no good ): TypeError('don\'t know how to handle uri %s' % repr(uri)

How are word vectors co-trained with paragraph vectors in doc2vec DBOW?

阅读更多关于 How are word vectors co-trained with paragraph vectors in doc2vec DBOW?

问题 I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode ( dm=0 ). I know that it's disabled by default with dbow_words=0 . But what happens when we set dbow_words to 1? In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N p -dimensional paragraph vectors plus the parameters of the classifier. But multiple sources hint that it is possible in DBOW

Gensim example, TypeError:between str and int error

阅读更多关于 Gensim example, TypeError:between str and int error

问题 When running the below code. this Python 3.6, latest Gensim library in Jupyter for model in models: print(str(model)) pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20)) [1]: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb 回答1: string= "machine learning".split() doc_vector = model.infer_vector(string) out= model.docvecs.most_similar([doc_vector]) I'm not sure 100% since I'm using a more recent release, but I think that the

Term weighting for original LDA in gensim

阅读更多关于 Term weighting for original LDA in gensim

问题 I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf... My question is, what is the term weighting that should be used for the original LDA? If I have understood correctly the weights should be term frequencies, but I am not sure. 回答1: It should be a corpus represented as a "bag of words". Or, yes, lists of term counts. The correct format is that of the corpus defined in the first tutorial

Gensim Word2Vec changing the input sentence order?

阅读更多关于 Gensim Word2Vec changing the input sentence order?

问题 In the gensim's documentation window size is defined as, window is the maximum distance between the current and predicted word within a sentence. which should mean when looking at context it doesn't go beyond the sentence boundary. right? What i did was i created a document with several thousand tweets and selected a word ( q1 ) and then selected most similar words to q1 (using model.most_similar('q1') ). But then, if I randomly shuffle the tweets in the input document and then did the same

Extract array (column name, data) from Pandas DataFrame

阅读更多关于 Extract array (column name, data) from Pandas DataFrame

问题 This is my first question at Stack Overflow. I have a DataFrame of Pandas like this. a b c d one 0 1 2 3 two 4 5 6 7 three 8 9 0 1 four 2 1 1 5 five 1 1 8 9 I want to extract the pairs of column name and data whose data is 1 and each index is separate at array. [ [(b,1.0)], [(d,1.0)], [(b,1.0),(c,1.0)], [(a,1.0),(b,1.0)] ] I want to use gensim of python library which requires corpus as this form. Is there any smart way to do this or to apply gensim from pandas data? 回答1: Many gensim functions

Error loading Pretrained vectors on gensim 0.12

阅读更多关于 Error loading Pretrained vectors on gensim 0.12

问题 I am calling load like this . .7/dist-packages/gensim/utils.py", line 912, in model = gensim.models.Word2Vec.load("F:\\TrialGrounds\\gensimMODEL4\\model4") model = super(Word2Vec, cls).load(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 248, in load obj = unpickle(fname) File "/usr/local/lib/python2unpickle return _pickle.loads(f.read()) AttributeError: 'module' object has no attribute 'call_on_class_only' The model has split 500mb *2 numpy arrays. Can

Create a dictionary with 'word groups'

阅读更多关于 Create a dictionary with 'word groups'

问题 I would like to do some text analysis on job descriptions and was going to use nltk. I can build a dictionary and remove the stopwords, which is part of what I want. However in addition to the single words and their frequencies I would like to keep meaningful 'word groups' and count them as well. For example in job descriptions containing 'machine learning' I don't want to consider 'machine' and 'learning' separately but keep retain the word group in my dictionary if it frequently occurs

How to perform efficient queries with Gensim doc2vec?

阅读更多关于 How to perform efficient queries with Gensim doc2vec?

问题 I’m working on a sentence similarity algorithm with the following use case: given a new sentence, I want to retrieve its n most similar sentences from a given set. I am using Gensim v.3.7.1, and I have trained both word2vec and doc2vec models. The results of the latter outperform word2vec’s, but I’m having trouble performing efficient queries with my Doc2Vec model. This model uses the distributed bag of words implementation (dm = 0). I used to infer similarity using the built in method model