gensim

【NLP】【五】gensim之Word2Vec

倖福魔咒の 提交于 2019-12-06 06:40:59
【一】整体流程综述 gensim底层封装了Google的Word2Vec的c接口,借此实现了word2vec。使用gensim接口非常方便,整体流程如下: 1. 数据预处理(分词后的数据) 2. 数据读取 3.模型定义与训练 4.模型保存与加载 5.模型使用(相似度计算,词向量获取) 【二】gensim提供的word2vec主要功能 【三】gensim接口使用示例 1. 使用jieba进行分词。 文本数据:《人民的名义》的小说原文作为语料 百度云盘:https://pan.baidu.com/s/1ggA4QwN # -*- coding:utf-8 -*- import jieba def preprocess_in_the_name_of_people(): with open("in_the_name_of_people.txt",mode='rb') as f: doc = f.read() doc_cut = jieba.cut(doc) result = ' '.join(doc_cut) result = result.encode('utf-8') with open("in_the_name_of_people_cut.txt",mode='wb') as f2: f2.write(result) 2. 使用原始text8.zip进行词向量训练 from

Doc2vec: model.docvecs is only of length 10

只谈情不闲聊 提交于 2019-12-06 06:02:53
I am trying doc2vec for 600000 rows of sentences and my code is below: model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores) model.build_vocab(res) model.train(res, total_examples=model.corpus_count, epochs=model.iter) #len(res) = 663406 #length of unique words 15581 print(len(model.wv.vocab)) #length of doc vectors is 10 len(model.docvecs) # each of length 100 len(model.docvecs[1]) How do I interpret this result? why is the length of vector only 10 with each of size 100? when the length of 'res' is 663406, it does not make sense. I know something

Cosine Similarity and LDA topics

大憨熊 提交于 2019-12-06 04:24:29
I want to compute Cosine Similarity between LDA topics. In fact, gensim function .matutils.cossim can do it but I dont know which parameter (vector ) I can use for this function? Here is a snap of code : import numpy as np import lda from sklearn.feature_extraction.text import CountVectorizer cvectorizer = CountVectorizer(min_df=4, max_features=10000, stop_words='english') cvz = cvectorizer.fit_transform(tweet_texts_processed) n_topics = 8 n_iter = 500 lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter) X_topics = lda_model.fit_transform(cvz) n_top_words = 6 topic_summaries = [] topic_word =

How to convert gensim Word2Vec model to FastText model?

[亡魂溺海] 提交于 2019-12-06 02:57:43
I have a Word2Vec model which was trained on a huge corpus. While using this model for Neural network application I came across quite a few "Out of Vocabulary" words. Now I need to find word embeddings for these "Out of Vocabulary" words. So I did some googling and found that Facebook has recently released a FastText library for this. Now my question is how can I convert my existing word2vec model or Keyedvectors to FastText model? FastText is able to create vectors for subword fragments by including those fragments in the initial training, from the original corpus. Then, when encountering an

Training wordvec in Tensorflow, importing to Gensim

自闭症网瘾萝莉.ら 提交于 2019-12-06 02:35:41
I am training a word2vec model from the tensorflow tutorial. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py After training I get the embedding matrix. I would like to save this and import it as a trained model in gensim. To load a model in gensim, the command is: model = Word2Vec.load_word2vec_format(fn, binary=True) But how do I generate the fn file from Tensorflow? Thanks One way to is save the file in the non-binary Word2Vec format, which essentially looks like this: num_words vector_size # this is the header label0 x00 x01 ...

How to handle words that are not in word2vec's vocab optimally

大憨熊 提交于 2019-12-06 02:16:38
I have a list of ~10 million sentences, where each of them contains up to 70 words. I'm running gensim word2vec on every word, and then taking the simple average of each sentence. The problem is that I use min_count=1000, so a lot of words are not in the vocab. To solve that, I intersect the vocab array (that contains about 10000 words) with every sentence, and if there's at least one element left in that intersection, it returns its the simple average, otherwise, it returns a vector of zeros. The issue is that calculating every average takes a very long time when I run it on the whole dataset

Python: What is the “size” parameter in Gensim Word2vec model class

穿精又带淫゛_ 提交于 2019-12-06 02:15:26
问题 I have been struggling to understand the use of size parameter in the gensim.models.Word2Vec From the Gensim documentation, size is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of closeness with the other words in the sentence for each word. So, suppose if my vocab size is 30 then how does it create a vector with the dimension greater than 30? Can anyone please brief me on the optimal value of Word2Vec size? Thank you. 回答1:

What are doc2vec training iterations?

為{幸葍}努か 提交于 2019-12-06 02:14:49
I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents. However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop ? Please let me know how I should change the following code to train the model for 20 epoches. Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.

Using freebase vectors with gensim

让人想犯罪 __ 提交于 2019-12-05 21:31:32
I am trying to use the freebase word embeddings released by Google, but I have a hard time getting the words from the freebase name. model = gensim.models.Word2Vec.load_word2vec_format('freebase-vectors-skipgram1000.bin',binary=True) model.vocab.keys()[:10] Out[22]: [u'/m/026tg5z', u'/m/018jz8', u'/m/04klsk', u'/m/08gd39', u'/m/0kt94', u'/m/05mtf0t', u'/m/05tjjb', u'/m/01m3vn', u'/m/0h7p35', u'/m/03ggvg3'] Does anyone know if it exist some kind of table to map the freebase representations into the words they represent ? Regards, Hedi Someone has actually done a nice thing for us all and mapped

Retrieve string version of document by ID in Gensim

风流意气都作罢 提交于 2019-12-05 19:37:50
问题 I am using Gensim for some topic modelling and I have gotten to the point where I am doing similarity queries using the LSI and tf-idf models. I get back the set of IDs and similarities, eg. (299501, 0.64505910873413086) . How do I get the text document that is related to the ID, in this case 299501? I have looked at the docs for corpus, dictionary, index, and the model and cannot seem to find it. 回答1: I have just gone through the same process and reached the same point of having "sims" with