gensim

Topic distribution: How do we see which document belong to which topic after doing LDA in python

不想你离开。 提交于 2019-11-27 11:23:15
I am able to run the LDA code from gensim and got the top 10 topics with their respective keywords. Now I would like to go a step further to see how accurate the LDA algo is by seeing which document they cluster into each topic. Is this possible in gensim LDA? Basically i would like to do something like this, but in python and using gensim. LDA with topicmodels, how can I see which topics different documents belong to? Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this

Word2Vec

浪尽此生 提交于 2019-11-27 07:52:14
版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。 本文链接:https://blog.csdn.net/qq_28840013/article/details/89681499 这里,我们不讲word2vec的原理(其实是还了解不透彻,以后明白了再写,大家在阅读本文之前,可以先简单了解一下其推理过程),就只了解其参数和输入输出。网上还有对word2vec用tensorflow进行的实现,以后再说吧。 1.Word2vec作用:表达不同词之间的相似和类比关系 2.安装方法:pip install --upgrade gensim #因为Gensim开发了一套工具箱叫做gensim,里面继承了Word2vec方法。 3.输入参数格式: import gensim #sentences=[["a","b"],["b","c"] ... ] sentences=word2vec.Text8Corpus("test.txt") #text8为语料库文件名 #sentences是训练所需预料,可通过该方式加载,此处训练集为英文文本或分好词的中文文本 1 2 3 4 sentences是训练所需材料,可通过两种格式载入: 1.文本格式: 将每篇文章 分词去停用词后,用空格分割,将其存入txt文本中(每一行一篇文章) 这个格式文本处理后

Why are multiple model files created in gensim word2vec?

北城以北 提交于 2019-11-27 06:43:59
问题 When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows. word2vec (File) word2vec.syn1nef.npy (NPY file) word2vec.wv.syn0.npy (NPY file) I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files). Please help me. 回答1: Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store

How to print the LDA topics models from gensim? Python

我们两清 提交于 2019-11-27 05:17:56
问题 Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a NoneType : Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable The code: from gensim import corpora, models, similarities

Python异常处理

柔情痞子 提交于 2019-11-27 04:44:46
AttributeError: 'dict' object has no attribute 'iteritems' Python3.5中:iteritems变为items Python里面的write()方法写入文件时候的乱码解决方法 为了为今后的大数据以及人工智能的大潮流的到来做准备,最近在学Python,在这个过程中,会遇到许多汉字之间的转换,今天在写write方法的时候,发现写入的汉字会出现乱码,百思不得其解,上网查众资料,,得出,原来在open打开文件的时候写一个 encoding="utf-8"即可,上代码 fos = open("index.text", "w", encoding="utf-8") fos.write("我今年十八岁") fos.close() 写入的时候务必以“w”写的方式打开,不然会报错 Python3解决UnicodeDecodeError:'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte 有两种办法: 这个时候可以选择修改字符集参数,一般这种情况出现得较多是在国标码(GBK)和utf8之间选择出现了问题。 出现异常报错是由于设置了decode()方法的第二个参数errors为严格(strict)形式造成的,因为默认就是这个参数

Interpreting the sum of TF-IDF scores of words across documents

我怕爱的太早我们不能终老 提交于 2019-11-27 02:36:55
问题 First let's extract the TF-IDF scores per term per document: from gensim import corpora, models, similarities documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in

全网独发gensim中similarities.Similarity用法

匆匆过客 提交于 2019-11-27 00:54:46
index = similarities.MatrixSimilarity(lsi[corpus]) # 管网的原文翻译如下: 警告: similarities.MatrixSimilarity 类仅仅适合能将所有的向量都在内存中的情况。例如,如果一个百万文档级的语料库使用该类,可能需要2G内存与256维LSI空间。 如果没有足够的内存,你可以使用 similarities.Similarity 类。该类的操作只需要固定大小的内存,因为他将索引切分为多个文件(称为碎片)存储到硬盘上了。它实际上使用了 similarities.MatrixSimilarity 和 similarities.SparseMatrixSimilarity 两个类,因此它也是比较快的,虽然看起来更加复杂了。 现在我就是大语料库,MatrixSimilarity这个类运行,就报错 Memory Error 可是关于similarities.Similarity 用法 在哪里呢??在哪里呢??在哪里呢??在哪里呢?? 搜尽全网都没有答案,最可恶的是管网也不提这个用法。你不写参数,我知道咋用啊。 感恩,感恩 https://stackoverflow.com/questions/36578341/how-to-use-similarities-similarity-in-gensim 一位小哥写了这样的答案

How to calculate the sentence similarity using word2vec model of gensim with python

半世苍凉 提交于 2019-11-26 23:26:50
According to the Gensim Word2Vec , I can use the word2vec model in gensim package to calculate the similarity between 2 words. e.g. trained_model.similarity('woman', 'man') 0.73723527 However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn't seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal? This is actually a pretty challenging problem that you are asking. Computing

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

一世执手 提交于 2019-11-26 23:22:15
问题 I am starting with some python task, I am facing a problem while using gensim. I am trying to load files from my disk and process them (split them and lowercase() them) The code I have is below: dictionary_arr=[] for file_path in glob.glob(os.path.join(path, '*.txt')): with open (file_path, "r") as myfile: text=myfile.read() for words in text.lower().split(): dictionary_arr.append(words) dictionary = corpora.Dictionary(dictionary_arr) The list (dictionary_arr) contains the list of all words

How to create a word cloud from a corpus in Python?

做~自己de王妃 提交于 2019-11-26 22:33:21
问题 From Creating a subset of words from a corpus in R, the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim Mmcorpus into a word cloud? The result will look somewhat like this: 回答1: Here's a blog post which does just that: http://peekaboo-vision.blogspot.com/2012/11/a-wordcloud-in-python.html The whole code is here: https://github.com/amueller/word_cloud