gensim

How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

假如想象 提交于 2019-12-10 15:51:04
问题 I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages: Page from gensim github issues section. It was a question by someone where the answer was to subclass WikiCorpus (answered by Piskvorky). Luckily, in the same page, there was a code representing the suggested 'subclass' solution. The code was provided by Rhazegh. (link) Page from

What does epochs mean in Doc2Vec and train when I have to manually run the iteration?

安稳与你 提交于 2019-12-10 15:49:34
问题 I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function. In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the Doc2Vec enough? Also how epochs in Doc2Vec is different from epochs in train ? documents = Documents(train_set) model = Doc2Vec(vector_size=100, dbow_words=1, dm=0, epochs=4000, window=5, seed=1337, min_count=5, workers=4, alpha=0.001, min_alpha=0

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

佐手、 提交于 2019-12-10 14:53:51
问题 I am working on node2vec. When I am using small dataset the code works well. But as soon as I try to run the same code on large dataset, the code crashes. Error: Process finished with exit code 134 (interrupted by signal 6: SIGABRT). The line which is giving error is model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter) I am using pycharm and python 3.5. Any idea what is happening? I could not found any post which could

should i use tfidf corpus or just corpus to inference documents using LDA?

删除回忆录丶 提交于 2019-12-10 14:33:22
问题 i am just wondering whether its either TFIDF corpus to be used or just corpus to be used when we are inference documents using LDA in gensim Here is an example from gensim import corpora, models import numpy.random numpy.random.seed(10) doc0 = [(0, 1), (1, 1)] doc1 = [(0,1)] doc2 = [(0, 1), (1, 1)] doc3 = [(0, 3), (1, 1)] corpus = [doc0,doc1,doc2,doc3] dictionary = corpora.Dictionary(corpus) tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] corpus_tfidf.save('x.corpus_tfidf')

Cosine Similarity and LDA topics

余生长醉 提交于 2019-12-10 11:06:39
问题 I want to compute Cosine Similarity between LDA topics. In fact, gensim function .matutils.cossim can do it but I dont know which parameter (vector ) I can use for this function? Here is a snap of code : import numpy as np import lda from sklearn.feature_extraction.text import CountVectorizer cvectorizer = CountVectorizer(min_df=4, max_features=10000, stop_words='english') cvz = cvectorizer.fit_transform(tweet_texts_processed) n_topics = 8 n_iter = 500 lda_model = lda.LDA(n_topics=n_topics, n

How to use pretrained Word2Vec model in Tensorflow

让人想犯罪 __ 提交于 2019-12-10 04:01:52
问题 I have a Word2Vec model which is trained in Gensim . How can I use it in Tensorflow for Word Embeddings . I don't want to train Embeddings from scratch in Tensorflow. Can someone tell me how to do it with some example code? 回答1: Let's assume you have a dictionary and inverse_dict list, with index in list corresponding to most common words: vocab = {'hello': 0, 'world': 2, 'neural':1, 'networks':3} inv_dict = ['hello', 'neural', 'world', 'networks'] Notice how the inverse_dict index

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

房东的猫 提交于 2019-12-10 01:59:21
问题 I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below. -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py Traceback (most recent call last): File "prog_w2v.py", line 7, in <module> models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) File "/usr/local/lib/python2.7/dist-packages/gensim

Why Gensim doc2vec give AttributeError: 'list' object has no attribute 'words'?

China☆狼群 提交于 2019-12-09 09:34:14
问题 I am trying to experiment gensim doc2vec, by using following code. As far as I understand from tutorials, it should work. However it gives AttributeError: 'list' object has no attribute 'words'. from gensim.models.doc2vec import LabeledSentence, Doc2Vec document = LabeledSentence(words=['some', 'words', 'here'], tags=['SENT_1']) model = Doc2Vec(document, size = 100, window = 300, min_count = 10, workers=4) So what did I do wrong? Any help please. Thank you. I am using python 3.5 and gensim 0

Gensim: how to retrain doc2vec model using previous word2vec model

旧城冷巷雨未停 提交于 2019-12-08 13:54:35
With Doc2Vec modelling, I have trained a model and saved following files: 1. model 2. model.docvecs.doctag_syn0.npy 3. model.syn0.npy 4. model.syn1.npy 5. model.syn1neg.npy However, I have a new way to label the documents and want to train the model again. since the word vectors already obtained from previous version. Is there any way to reuse that model (e.g., taking the previous w2v results as initial vectors for training)? Any one know how to do it? I've figured out that, we can just load the model and continue to train. model = Doc2Vec.load("old_model") model.train(sentences) 来源: https:/

Gensim Word2Vec Model trained but not saved

流过昼夜 提交于 2019-12-08 10:24:09
问题 I am using gensim and executed the following code (simplified): model = gensim.models.Word2Vec(...) mode.build_vocab(sentences) model.train(...) model.save('file_name') After days my code finished model.train(...) . However, during saving, I experienced: Process finished with exit code 137 (interrupted by signal 9: SIGKILL) I noticed that there were some npy files generated: <...>.trainables.syn1neg.npy <...>.trainables.vectors_lockf.npy <...>.wv.vectors.npy Are those intermediate results I