gensim | 易学教程

基于python的gensim word2vec训练词向量

阅读更多关于基于python的gensim word2vec训练词向量

准备工作当我们下载了anaconda后，可以在命令窗口通过命令 conda install gensim 安装gensim gensim介绍 gensim是一款强大的自然语言处理工具，里面包括N多常见模型，我们体验一下： interfaces – Core gensim interfaces utils – Various utility functions matutils – Math utils corpora .bleicorpus – Corpus in Blei’s LDA-C format corpora .dictionary – Construct word<->id mappings corpora .hashdictionary – Construct word<->id mappings corpora .lowcorpus – Corpus in List-of-Words format corpora .mmcorpus – Corpus in Matrix Market format corpora .svmlightcorpus – Corpus in SVMlight format corpora .wikicorpus – Corpus from a Wikipedia dump corpora .textcorpus – Building

Gensim word2vec on predefined dictionary and word-indices data

阅读更多关于 Gensim word2vec on predefined dictionary and word-indices data

问题 I need to train a word2vec representation on tweets using gensim. Unlike most tutorials and code I've seen on gensim my data is not raw, but has already been preprocessed. I have a dictionary in a text document containing 65k words (incl. an "unknown" token and a EOL token) and the tweets are saved as a numpy matrix with indices into this dictionary. A simple example of the data format can be seen below: dict.txt you love this code tweets (5 is unknown and 6 is EOL) [[0, 1, 2, 3, 6], [3, 5, 5

Loss does not decrease during training (Word2Vec, Gensim)

阅读更多关于 Loss does not decrease during training (Word2Vec, Gensim)

What can cause loss from model.get_latest_training_loss() increase on each epoch? Code, used for training: class EpochSaver(CallbackAny2Vec): '''Callback to save model after each epoch and show training parameters ''' def __init__(self, savedir): self.savedir = savedir self.epoch = 0 os.makedirs(self.savedir, exist_ok=True) def on_epoch_end(self, model): savepath = os.path.join(self.savedir, "model_neg{}_epoch.gz".format(self.epoch)) model.save(savepath) print( "Epoch saved: {}".format(self.epoch + 1), "Start next epoch ... ", sep="\n" ) if os.path.isfile(os.path.join(self.savedir, "model_neg{

【自然语言处理】利用LDA对希拉里邮件进行主题分析

阅读更多关于【自然语言处理】利用LDA对希拉里邮件进行主题分析

首先是读取数据集，并将csv中ExtractedBodyText为空的给去除掉 import pandas as pd import re import os dir_path=os.path.dirname(os.path.abspath(__file__)) data_path=dir_path+"/Database/HillaryEmails.csv" df=pd.read_csv(data_path) df=df[['Id','ExtractedBodyText']].dropna() 对于这些邮件信息，并不是所有的词都是有意义的，也就是先要去除掉一些噪声数据： def clean_email_text(text): text = text.replace('\n'," ") #新行，我们是不需要的 text = re.sub(r"-", " ", text) #把 "-" 的两个单词，分开。（比如：july-edu ==> july edu） text = re.sub(r"\d+/\d+/\d+", "", text) #日期，对主体模型没什么意义 text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) #时间，没意义 text = re.sub(r"[\w]+@[\.\w]+", "", text) #邮件地址，没意义

How does gensim calculate doc2vec paragraph vectors

阅读更多关于 How does gensim calculate doc2vec paragraph vectors

问题 i am going thorugh this paper http://cs.stanford.edu/~quocle/paragraph_vector.pdf and it states that " Theparagraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use concatenation as the method to combine the vectors." How does concatenation or averaging work? example (if paragraph 1 contain word1 and word2): word1 vector =[0.1,0.2,0.3] word2 vector =[0.4,0.5,0.6] concat method does paragraph vector = [0.1+0.4,0.2+0.5,0.3+0

Load gensim Word2Vec computed in Python 2, in Python 3

阅读更多关于 Load gensim Word2Vec computed in Python 2, in Python 3

I have a gensim Word2Vec model computed in Python 2 like that: from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence model = Word2Vec(LineSentence('enwiki.txt'), size=100, window=5, min_count=5, workers=15) model.save('w2v.model') However, I need to use it in Python 3. If I try to load it, import gensim from gensim.models import Word2Vec model = Word2Vec.load('w2v.model') it results in an error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xf9 in position 0: ordinal not in range(128) I suppose the problem is in differences in encoding between Python2 and

How can I access output embedding(output vector) in gensim word2vec?

阅读更多关于 How can I access output embedding(output vector) in gensim word2vec?

问题 I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings). I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling. But when I calculated most_similar with output vector, I got same result in some ranges because of removing syn1 or syn1neg. Here is what I got. IN[1]: model = Word2Vec.load('test_model.model') IN[2]: model.most_similar([model.syn1neg[0]]) OUT[2]: [('of', -0.04402521997690201), (

Error when implementing gensim.LdaMallet

阅读更多关于 Error when implementing gensim.LdaMallet

I was following the instructions on this link (" http://radimrehurek.com/2014/03/tutorial-on-mallet-in-python/ "), however I came across an error when I tried to train the model: model = models.LdaMallet(mallet_path, corpus, num_topics =10, id2word = corpus.dictionary) IOError: [Errno 2] No such file or directory: 'c:\\users\\brlu\\appdata\\local\\temp\\c6a13a_state.mallet.gz' Please share any thoughts you might have. Thanks. This can happen for two reasons: 1. You have space in your mallet path. 2. There is no MALLET_HOME environment variable. Make sure that mallet properly works from command

How to get a complete topic distribution for a document using gensim LDA?

阅读更多关于 How to get a complete topic distribution for a document using gensim LDA?

问题 When I train my lda model as such dictionary = corpora.Dictionary(data) corpus = [dictionary.doc2bow(doc) for doc in data] num_cores = multiprocessing.cpu_count() num_topics = 50 lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1) I want to get a full topic distribution for all num_topics for each and every document. That is, in this particular case, I want each document to have 50 topics contributing to the distribution and I want to

Python: What is the “size” parameter in Gensim Word2vec model class

阅读更多关于 Python: What is the “size” parameter in Gensim Word2vec model class

I have been struggling to understand the use of size parameter in the gensim.models.Word2Vec From the Gensim documentation, size is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of closeness with the other words in the sentence for each word. So, suppose if my vocab size is 30 then how does it create a vector with the dimension greater than 30? Can anyone please brief me on the optimal value of Word2Vec size? Thank you. size is, as you note, the dimensionality of the vector. Word2Vec needs large, varied text examples to create