gensim

How to load sentences into Python gensim?

孤者浪人 提交于 2019-12-03 02:20:46
I am trying to use the word2vec module from gensim natural language processing library in Python. The docs say to initialize the model: from gensim.models import word2vec model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) What format does gensim expect for the input sentences? I have raw text "the quick brown fox jumps over the lazy dogs" "Then a cop quizzed Mick Jagger's ex-wives briefly." etc. What additional processing do I need to post into word2fec ? UPDATE: Here is what I have tried. When it loads the sentences, I get nothing. >>> sentences = ['the quick brown fox

How to calculate the sentence similarity using word2vec model of gensim with python

匿名 (未验证) 提交于 2019-12-03 01:54:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: According to the Gensim Word2Vec , I can use the word2vec model in gensim package to calculate the similarity between 2 words. e.g. trained_model . similarity ( 'woman' , 'man' ) 0.73723527 However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn't seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal? 回答1: This is

Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'

匿名 (未验证) 提交于 2019-12-03 01:45:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am learning Doc2Vec model from gensim library and using it as follows: class MyTaggedDocument(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin: print(fname) for item_no, sentence in enumerate(fin): yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]) sentences = MyTaggedDocument(dirname) model = Doc2Vec

gensim.LDAMulticore throwing exception:

匿名 (未验证) 提交于 2019-12-03 01:41:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am running LDAMulticore from the python gensim library, and the script cannot seem to create more than one thread. Here is the error: Traceback (most recent call last): File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python2.7/multiprocessing/pool.py", line 97, in worker initializer(*initargs) File "/usr/lib64/python2.7/site-packages/gensim/models/ldamulticore.py",

Document topical distribution in Gensim LDA

匿名 (未验证) 提交于 2019-12-03 01:09:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I've derived a LDA topic model using a toy corpus as follows: documents = ['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey'] texts =

Gensim学习笔记-2.主题与变换

匿名 (未验证) 提交于 2019-12-03 00:43:02
from pprint import pprint import warnings warnings . filterwarnings ( action = 'ignore' , category = UserWarning , module = 'gensim' ) from gensim import corpora stopWordsList = set ( 'for a of the and to in' . split ()) with open ( './Data/mycorpus.txt' , encoding = 'utf-8' ) as f : texts = [[ word for word in line . lower (). split () if word not in stopWordsList ] for line in f ] dictionary = corpora . Dictionary . load ( './Data/sampleDict.dict' ) corpus = [ dictionary . doc2bow ( doc ) for doc in texts ] pprint ( corpus ) [[( 0 , 1 ), ( 1 , 1 ), ( 2 , 1 )], [( 0 , 1 ), ( 3 , 1 ), ( 4 , 1

gensim使用方法以及例子

匿名 (未验证) 提交于 2019-12-03 00:26:01
阅读数:25007 gensim是一个python的自然语言处理库,能够将文档根据TF-IDF, LDA, LSI 等模型转化成向量模式,以便进行进一步的处理。此外,gensim还实现了word2vec功能,能够将单词转化为词向量。关于词向量的知识可以看我之前的 文章 关于gensim的使用方法,我是根据官网的资料来看的,思路也是跟着官网tutorial走的,英文好的或者感觉我写的不全面的可以去 官网 看 1. corpora 和 dictionary 1.1 基本概念和用法 corpora是gensim中的一个基本概念,是文档集的表现形式 ,也是后续进一步处理的基础。从本质上来说,corpora其实是一种格式或者说约定,其实就是一个二维矩阵。举个例子,现在有一个文档集,里面有两篇文档 hurry up rise up 1 2 这两篇文档里总共出现了3个词,hurry, rise, up。如果将这3个词映射到数字,比如说hurry, rise, up 分别对应1,2,3, 那么上述的文档集的一种表现形式可以是 1,0,1 0,1,1 1 2 那么,如何将字符串形式的文档转化成上述形式呢?这里就要提到 词典 在将文档分割成词语之后,使用 dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text)

Gensim入门教程

匿名 (未验证) 提交于 2019-12-03 00:22:01
http://www.shuang0420.com/2016/05/18/Gensim-%E7%94%A8Python%E5%81%9A%E4%B8%BB%E9%A2%98%E6%A8%A1%E5%9E%8B/ gensim 介绍 gemsim是一个免费python库,能够从文档中有效地自动抽取语义主题。gensim中的算法包括:LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), RP (Random Projections), 通过在一个训练文档语料库中,检查词汇统计联合出现模式, 可以用来发掘文档语义结构,这些算法属于非监督学习,可以处理原始的,非结构化的文本(”plain text”)。 gensim 特性 内存独立- 对于训练语料来说,没必要在任何时间将整个语料都驻留在RAM中 有效实现了许多流行的向量空间算法-包括tf-idf,分布式LSA, 分布式LDA 以及 RP;并且很容易添加新算法 对流行的数据格式进行了IO封装和转换 在其语义表达中,可以相似查询 gensim的创建的目的是,由于缺乏简单的(java很复杂)实现主题建模的可扩展软件框架. gensim 设计原则 简单的接口,学习曲线低。对于原型实现很方便 根据输入的语料的size来说,内存各自独立;基于流的算法操作,一次访问一个文档.

How to intrepret Clusters results after using Doc2vec?

一曲冷凌霜 提交于 2019-12-02 21:13:45
问题 I am using doc2vec to convert the top 100 tweets of my followers in vector representation (say v1.....v100). After that I am using the vector representation to do the K-Means clusters. model = Doc2Vec(documents=t, size=100, alpha=.035, window=10, workers=4, min_count=2) I can see that cluster 0 is dominated by some values (say v10, v12, v23, ....). My question is what does these v10, v12 ... etc represents. Can I deduce that these specific column clusters specific keywords of document. 回答1:

Can we use a self made corpus for training for LDA using gensim?

别等时光非礼了梦想. 提交于 2019-12-02 21:06:01
I have to apply LDA (Latent Dirichlet Allocation) to get the possible topics from a data base of 20,000 documents that I collected. How can I use these documents rather than the other corpus available like the Brown Corpus or English Wikipedia as training corpus ? You can refer this page. After going through the documentation of the Gensim package, I found out that there are total 4 ways of transforming a text repository into a corpus. There are total 4 formats for the corpus: Market Matrix (.mm) SVM Light (.svmlight) Blie Format (.lad-c) Low Format (.low) In this problem, as mentioned above