gensim

ELKI Kmeans clustering Task failed error for high dimensional data

只愿长相守 提交于 2019-12-04 05:51:43
问题 I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error. Task failed de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation

Updating training documents for gensim Doc2Vec model

隐身守侯 提交于 2019-12-04 05:41:59
问题 I have an existing gensim Doc2Vec model, and I'm trying to do iterative updates to the training set, and by extension, the model. I take the new documents, and perform preproecssing as normal: stoplist = nltk.corpus.stopwords.words('english') train_corpus= [] for i, document in enumerate(corpus_update['body'].values.tolist()): train_corpus.append(gensim.models.doc2vec.TaggedDocument([word for word in gensim.utils.simple_preprocess(document) if word not in stoplist], [i])) I then load the

What does “word for word” syntax mean in Python?

萝らか妹 提交于 2019-12-04 05:12:21
问题 I see the following script snippet from the gensim tutorial page. What's the syntax of word for word in below Python script? >> texts = [[word for word in document.lower().split() if word not in stoplist] >> for document in documents] 回答1: This is a list comprehension. The code you posted loops through every element in document.lower.split() and creates a new list that contains only the elements that meet the if condition. It does this for each document in documents . Try it out... elems = [1

Gensim Word2Vec select minor set of word vectors from pretrained model

痞子三分冷 提交于 2019-12-04 03:56:15
问题 I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model. The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer. Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words? 回答1: Thanks to this

Retrieve string version of document by ID in Gensim

对着背影说爱祢 提交于 2019-12-04 02:39:29
I am using Gensim for some topic modelling and I have gotten to the point where I am doing similarity queries using the LSI and tf-idf models. I get back the set of IDs and similarities, eg. (299501, 0.64505910873413086) . How do I get the text document that is related to the ID, in this case 299501? I have looked at the docs for corpus, dictionary, index, and the model and cannot seem to find it. I have just gone through the same process and reached the same point of having "sims" with a document ID but wanting my original "article code". Although it's not provided entirely, there is a

Using pretrained gensim Word2vec embedding in keras

筅森魡賤 提交于 2019-12-04 01:26:55
问题 I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this? PS: It is different from other questions because I am using gensim for word2vec training instead of keras. 回答1: Let's say you have following data that

How to get vocabulary word count from gensim word2vec?

北城以北 提交于 2019-12-04 00:59:36
问题 I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary? 回答1: Each word in the vocabulary has an associated vocabulary object, which contains an index and a count. vocab_obj = w2v.vocab["word"] vocab_obj.count Output for google news w2v model: 2998437 So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary. for word, vocab_obj in w2v.vocab

Generator is not an iterator?

こ雲淡風輕ζ 提交于 2019-12-04 00:00:18
I have an generator (a function that yields stuff), but when trying to pass it to gensim.Word2Vec I get the following error: TypeError: You can't pass a generator as the sentences argument. Try an iterator. Isn't a generator a kind of iterator? If not, how do I make an iterator from it? Looking at the library code, it seems to simply iterate over sentences like for x in enumerate(sentences) , which works just fine with my generator. What is causing the error then? Generator is exhausted after one loop over it. Word2vec simply needs to traverse sentences multiple times (and probably get item

Gensim word2vec in python3 missing vocab

寵の児 提交于 2019-12-03 23:26:13
I'm using gensim implementation of Word2Vec. I have the following code snippet: print('training model') model = Word2Vec(Sentences(start, end)) print('trained model:', model) print('vocab:', model.vocab.keys()) When I run this in python2, it runs as expected. The final print is all the words in the vocabulary. However, if I run it in python3, I get an error: trained model: Word2Vec(vocab=102, size=100, alpha=0.025) Traceback (most recent call last): File "learn.py", line 58, in <module> train(to_datetime('-4h'), to_datetime('now'), 'model.out') File "learn.py", line 23, in train print('vocab:'

How to turn embeddings loaded in a Pandas DataFrame into a Gensim model?

自作多情 提交于 2019-12-03 21:28:48
I have a DataFrame in which the index are words and I have 100 columns with float number such that for each word I have its embedding as a 100d vector. I would like to convert my DataFrame object into a gensim model object so that I can use its methods; specially gensim.models.keyedvectors.most_similar() so that I can search for similar words within my subset. Which is the preferred way of doing that? Thanks Not sure what the "preferred" way of doing this is, but the format gensim expects is pretty easy to replicate: data = pd.DataFrame([[0.15941701, 0.84058299], [0.12190033, 0.87809967], [0