gensim | 易学教程

Semantic Similarity across multiple languages

阅读更多关于 Semantic Similarity across multiple languages

问题 I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good). So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)? 回答1: Let's assume that your sentence-similarity scheme uses only word-vectors

Gensim mallet CalledProcessError: returned non-zero exit status

阅读更多关于 Gensim mallet CalledProcessError: returned non-zero exit status

问题 I'm getting an error while trying to access gensims mallet in jupyter notebooks. I have the specified file 'mallet' in the same folder as my notebook, but cant seem to access it. I tried routing to it from the C drive but I still get the same error. Please help :) import os from gensim.models.wrappers import LdaMallet #os.environ.update({'MALLET_HOME':r'C:/Users/new_mallet/mallet-2.0.8/'}) mallet_path = 'mallet' # update this path ldamallet = gensim.models.wrappers.LdaMallet(mallet_path,

How to map the word in data frame to integer ID with python-pandas and gensim?

阅读更多关于 How to map the word in data frame to integer ID with python-pandas and gensim?

问题 Given such a data frame, including the item and corresponding review texts: item_id review_text B2JLCNJF16 i was attracted to this... B0009VEM4U great snippers... I want to map the top 5000 most frequent word in review_text , so the resulting data frame should be like: item_id review_text B2JLCNJF16 1 2 3 4 5... B0009VEM4U 6... #as the word "snippers" is out of the top 5000 most frequent word Or, a bag-of-word vector is highly preferred: item_id review_text B2JLCNJF16 [1,1,1,1,1....]

How to map the word in data frame to integer ID with python-pandas and gensim?

阅读更多关于 How to map the word in data frame to integer ID with python-pandas and gensim?

C extension not loaded for Word2Vec

阅读更多关于 C extension not loaded for Word2Vec

问题 I re-install the gensim pkg and Cython but it continusly show this warning, Does anybody know about this? I am using Python 3.6,PyCharm Linux Mint. UserWarning: C extension not loaded for Word2Vec, training will be slow. Install a C compiler and reinstall gensim for fast training. warnings.warn("C extension not loaded for Word2Vec, training will be slow. " And it also show this line when I create or load model. Slow version of gensim.models.doc2vec is being used 回答1: There is some problem

Training wordvec in Tensorflow, importing to Gensim

阅读更多关于 Training wordvec in Tensorflow, importing to Gensim

问题 I am training a word2vec model from the tensorflow tutorial. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py After training I get the embedding matrix. I would like to save this and import it as a trained model in gensim. To load a model in gensim, the command is: model = Word2Vec.load_word2vec_format(fn, binary=True) But how do I generate the fn file from Tensorflow? Thanks 回答1: One way to is save the file in the non-binary

Understanding LDA Transformed Corpus in Gensim

阅读更多关于 Understanding LDA Transformed Corpus in Gensim

问题 I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output: DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)] LDA 1 : [(29, 0.80571428571428572)] DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)] LDA 2 : [(29, 0.83809523809523812)] DOC 3 : [(3079, 1), (3395, 1), (4874, 1)] LDA 3 : [(34, 0.75714285714285712)] DOC 4 : [(1482, 1), (2806, 1), (3988, 1)] LDA 4 : [(22,

How to load a pre-trained Word2vec MODEL File and reuse it?

阅读更多关于 How to load a pre-trained Word2vec MODEL File and reuse it?

问题 I want to use a pre-trained word2vec model, but I don't know how to load it in python. This file is a MODEL file (703 MB). It can be downloaded here: http://devmount.github.io/GermanWordEmbeddings/ 回答1: just for loading import gensim # Load pre-trained Word2Vec model. model = gensim.models.Word2Vec.load("modelName.model") now you can train the model as usual. also, if you want to be able to save it and retrain it multiple times, here's what you should do model.train(//insert proper parameters

Generator is not an iterator?

阅读更多关于 Generator is not an iterator?

问题 I have an generator (a function that yields stuff), but when trying to pass it to gensim.Word2Vec I get the following error: TypeError: You can't pass a generator as the sentences argument. Try an iterator. Isn't a generator a kind of iterator? If not, how do I make an iterator from it? Looking at the library code, it seems to simply iterate over sentences like for x in enumerate(sentences) , which works just fine with my generator. What is causing the error then? 回答1: Generator is exhausted

文本处理方法概述

阅读更多关于文本处理方法概述

https://www.cnblogs.com/arachis/p/text_dig.html 说明：本篇以实践为主，理论部分会尽量给出参考链接摘要：　　1.分词　　2.关键词提取　　3.词的表示形式　　4.主题模型（LDA/TWE）　　　　5.几种常用的NLP工具简介　　6.文本挖掘（文本分类，文本标签化）　　　　6.1 数据预处理　　　　6.2 关于文本的特征工程　　　　6.3 关于文本的模型　　7.NLP任务（词性标注，句法分析）　　8.NLP应用（信息检索，情感分析，文章摘要，OCR，语音识别，图文描述，问答系统，知识图谱）　　　　8.1 知识抽取内容：　　1.分词　　分词是文本处理的第一步，词是语言的最基本单元，在后面的文本挖掘中无论是词袋表示还是词向量形式都是依赖于分词的，所以一个好的分词工具是非常重要的。这里以python的jieba分词进行讲解分词的基本流程，在讲解之前还是想说一下jieba分词的整个工作流程: 图1是jieba切词函数的4个可能过程，图2是一个根据DAG图计算最大概率路径，具体的代码走读请参考 jieba cut源码讲了这么多，我们还是要回归到实践中去，看下jieba的分词接口 1 # encoding=utf-8 2 import jieba 3 4 seg_list = jieba.cut(