gensim

GridSearch for doc2vec model built using gensim

时间秒杀一切 提交于 2019-12-11 08:44:02
问题 I am trying to find best hyperparameters for my trained doc2vec gensim model which takes a document as an input and create its document embeddings. My train data consists of text documents but it doesn't have any labels. i.e. I just have 'X' but not 'y'. I found some questions here related to what I am trying to do but all of the solutions are proposed for supervised models but none for unsupervised like mine. Here is the code where I am training my doc2vec model: def train_doc2vec( self, X:

Eclipse + PyDev ImportError

Deadly 提交于 2019-12-11 07:18:31
问题 I am having trouble getting PyDev on Eclipse to recognize installed modules ( gensim ), which work fine in IDLE. I am using Windows Vista, 32-bit. Python 2.7. I have found this question asked: here, here, here, and here. The recommended solution is to go to preferences > pydev > interpreter - python , and remove and re-add (w/ Auto Config) the python interpreter. I have done this, and have restarted Eclipse. In PYTHONPATH , C:\Python27\lib\site-packages\gensim-0.8.0-py2.7.egg , appears, but I

Gensim equivalent of training steps

江枫思渺然 提交于 2019-12-11 07:01:35
问题 Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic? If not, what default value does gensim use? Is the gensim parameter iter related to training steps? The TensorFlow script includes this section. with tf.Session(graph=graph) as session: # We must initialize all variables before we use them. init.run() print('Initialized') average_loss = 0 for step in xrange(num_steps): batch_inputs, batch_labels = generate

Gensim Word2Vec uses too much memory

蹲街弑〆低调 提交于 2019-12-11 06:46:48
问题 I want to train a word2vec model on a tokenized file of size 400MB. I have been trying to run this python code : import operator import gensim, logging, os from gensim.models import Word2Vec from gensim.models import * class Sentences(object): def __init__(self, filename): self.filename = filename def __iter__(self): for line in open(self.filename): yield line.split() def runTraining(input_file,output_file): logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging

Gensim Phrases usage to filter n-grams

久未见 提交于 2019-12-11 06:07:25
问题 I am using Gensim Phrases to identify important n-grams in my text as follows. bigram = Phrases(documents, min_count=5) trigram = Phrases(bigram[documents], min_count=5) for sent in documents: bigrams_ = bigram[sent] trigrams_ = trigram[bigram[sent]] However, this detects uninteresting n-grams such as special issue , important matter , high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning , human computer interaction etc. Is there a way to

Document similarity in Spacy vs Word2Vec

荒凉一梦 提交于 2019-12-11 05:05:23
问题 I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations. I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless. This is the relevant code. tfidf =

Error with calling Numpy, Scipy, Gensim in python3

孤者浪人 提交于 2019-12-11 04:53:38
问题 Why when I call Numpy, Scipy, Gensim with python3 in linux I have the following error? >import gensim _concrete_types = {v.type for k, v in _concrete_typeinfo.items()} AttributeError: 'tuple' object has no attribute 'type' 回答1: I've today observed this issue as well, but with tooling that depends only on pandas as numpy. I've also seen a similar issue here: AttributeError: 'tuple' object has no attribute 'type' upon importing tensorflow I can't add this as a comment because I don't have

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

此生再无相见时 提交于 2019-12-11 04:49:10
问题 I'm using Gensim with Fasttext Word vectors for return similar words. This is my code: import gensim model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec') words = model.most_similar(positive=['sole'],topn=10) print(words) This will return: [('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La',

Creating a wordvector model combining words from other models

女生的网名这么多〃 提交于 2019-12-10 22:13:47
问题 I have two different word vector models created using word2vec algorithm . Now issue i am facing is few words from first model is not there in second model . I want to create a third model from two different word vectors models where i can use word vectors from both models without loosing meaning and the context of word vectors. Can I do this, and if so, how? 回答1: You could potentially translate the vectors for the words only in one model to the other model's coordinate space, using other

python luigi died unexpectedly with exit code -11

你离开我真会死。 提交于 2019-12-10 20:45:38
问题 I have a data pipeline with luigi that works perfectly fine if I put 1 worker to the task. However, if I put > 1 workers, then it dies (unexpectedly with exit code -11) in a stage with 2 dependencies. The code is rather complex, so a minimum example would be difficult to give. The gist of the matter is that I am doing the following things with gensim : Building a dictionary from some texts. Building a corpus from said texts and the dictionary (requires (1)). Training an LDA model from the