http://www.shuang0420.com/2016/05/18/Gensim-%E7%94%A8Python%E5%81%9A%E4%B8%BB%E9%A2%98%E6%A8%A1%E5%9E%8B/

gensim 介绍

gemsim是一个免费python库，能够从文档中有效地自动抽取语义主题。gensim中的算法包括：LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), RP (Random Projections), 通过在一个训练文档语料库中，检查词汇统计联合出现模式, 可以用来发掘文档语义结构，这些算法属于非监督学习，可以处理原始的，非结构化的文本（”plain text”）。

gensim 特性

内存独立- 对于训练语料来说，没必要在任何时间将整个语料都驻留在RAM中
有效实现了许多流行的向量空间算法－包括tf-idf，分布式LSA, 分布式LDA 以及 RP；并且很容易添加新算法
对流行的数据格式进行了IO封装和转换
在其语义表达中，可以相似查询
gensim的创建的目的是，由于缺乏简单的（java很复杂）实现主题建模的可扩展软件框架.

gensim 设计原则

简单的接口，学习曲线低。对于原型实现很方便
根据输入的语料的size来说，内存各自独立；基于流的算法操作，一次访问一个文档.

gensim 核心概念

gensim的整个package会涉及三个概念：corpus, vector, model.

语库(corpus)
文档集合，用于自动推出文档结构，以及它们的主题等，也可称作训练语料。

向量(vector)
在向量空间模型(VSM)中，每个文档被表示成一个特征数组。例如，一个单一特征可以被表示成一个问答对(question-answer pair):
[1].在文档中单词”splonge”出现的次数？ 0个
[2].文档中包含了多少句子？ 2个
[3].文档中使用了多少字体? 5种
这里的问题可以表示成整型id (比如：1,2,3等), 因此，上面的文档可以表示成：(1, 0.0), (2, 2.0), (3, 5.0). 如果我们事先知道所有的问题，我们可以显式地写成这样：(0.0, 2.0, 5.0). 这个answer序列可以认为是一个多维矩阵（3维）. 对于实际目的，只有question对应的answer是一个实数.
对于每个文档来说，answer是类似的. 因而，对于两个向量来说（分别表示两个文档），我们希望可以下类似的结论：“如果两个向量中的实数是相似的，那么，原始的文档也可以认为是相似的”。当然，这样的结论依赖于我们如何去选取我们的question。

稀疏矩阵(Sparse vector)
通常，大多数answer的值都是0.0. 为了节省空间，我们需要从文档表示中忽略它们，只需要写：(2, 2.0), (3, 5.0) 即可(注意：这里忽略了(1, 0.0)). 由于所有的问题集事先都知道，那么在稀疏矩阵的文档表示中所有缺失的特性可以认为都是0.0.
gensim的特别之处在于，它没有限定任何特定的语料格式；语料可以是任何格式，当迭代时，通过稀疏矩阵来完成即可。例如，集合 ([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0])) 是一个包含两个文档的语料，每个都有两个非零的 pair。

模型(model)
对于我们来说，一个模型就是一个变换(transformation)，将一种文档表示转换成另一种。初始和目标表示都是向量－－它们只在question和answer之间有区别。这个变换可以通过训练的语料进行自动学习，无需人工监督，最终的文档表示将更加紧凑和有用；相似的文档具有相似的表示。

LDA 介绍

LDA 是一种典型的词袋模型，即一篇文档是由一组词构成，词与词之间没有顺序以及先后的关系。一篇文档可以包含多个主题，文档中每一个词都由其中的一个主题生成。

需要理解的概念有：

一个函数：gamma 函数
两个分布：beta分布、Dirichlet分布
一个模型：LDA（文档-主题，主题-词语）
一个采样：Gibbs采样

核心公式：

1	p(w\|d) = p(w\|t)*p(t\|d)

文档的生成过程

θi
zi,j
zi,j
zi,jwi,j

怎么选择 topic 个数

最小化 topic 的相似度
perplexity

Python gensim 实现

# install the related python packages >>> pip install numpy >>> pip install scipy >>> pip install gensim >>> pip install jieba  from gensim import corpora, models, similarities import logging import jieba  # configuration logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # load data from file f = open('newfile.txt', 'r') documents = f.readlines()  ＃ tokenize texts = [[word for word in jieba.cut(document, cut_all = False)] for document in documents]  # load id->word mapping (the dictionary) dictionary = corpora.Dictionary(texts)  # word must appear >10 times, and no more than 40% documents dictionary.filter_extremes(no_below=40, no_above=0.1)  # save dictionary dictionary.save('dict_v1.dict')  # load corpus corpus = [dictionary.doc2bow(text) for text in texts]  # initialize a model tfidf = models.TfidfModel(corpus)  # use the model to transform vectors, apply a transformation to a whole corpus corpus_tfidf = tfidf[corpus]  # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents), using 500 iterations lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=100, iterations=500)  # save model to files lda.save('mylda_v1.pkl')  # print topics composition, and their scores, for the first document. You will see that only few topics are represented; the others have a nil score. for index, score in sorted(lda[corpus_tfidf[0]], key=lambda tup: -1*tup[1]):     print "Score: {}\t Topic: {}".format(score, lda.print_topic(index, 10))  # print the most contributing words for 100 randomly selected topics lda.print_topics(100)  # load model and dictionary model = models.LdaModel.load('mylda_v1.pkl') dictionary = corpora.Dictionary.load('dict_v1.dict')  # predict unseen data query = "未收到奖励" query_bow = dictionary.doc2bow(jieba.cut(query, cut_all = False)) for index, score in sorted(model[query_bow], key=lambda tup: -1*tup[1]):     print "Score: {}\t Topic: {}".format(score, model.print_topic(index, 20))  # if you want to predict many lines of data in a file, do the followings f = open('newfile.txt', 'r') documents = f.readlines() texts = [[word for word in jieba.cut(document, cut_all = False)] for document in documents] corpus = [dictionary.doc2bow(text) for text in texts]  # only print the topic with the highest score for c in corpus:     flag = True     for index, score in sorted(model[c], key=lambda tup: -1*tup[1]):         if flag:             print "Score: {}\t Topic: {}".format(score, model.print_topic(index, 20))

Tips

If you occur encoding problems, you can try the following code

add it at the beginning of your python file # -*- coding: utf-8 -*-  # also, do the followings import sys reload(sys) sys.setdefaultencoding('utf-8')  # the following code may lead to encoding problem when there're Chinese characters model.show_topics(-1, 5)  # use this instead model.print_topics(-1, 5)

You can see step-by-step output by the following references.

References:
https://radimrehurek.com/gensim/tut2.html
http://blog.csdn.net/questionfish/article/details/46725475
https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

文章来源: Gensim入门教程

标签

gensim