lda

PCA LDA降维测试

谁说胖子不能爱 提交于 2019-11-27 16:12:43
测试概述 该实验的目的是测试LDA(Linear Discriminant Analysis,线性判别分析)的降维效果(主要是训练时间),同时引入了PCA(Principal components analysis,主成分分析)作为比较。 程序比较简单,降维算法和训练算法均是调用python的sklearn库的函数,所有代码都在程序中给出。 每次使用相同数据集,通过三种不同方式测试,分别是直接训练、PCA降维后训练、LDA降维后训练。 文件说明 code 测试程序文件夹,内含LDA_test.py程序 dataset 测试数据集文件夹 output 测试结果截图文件夹 测试环境 操作系统 win10 64位 CPU AMD Ryzen 5 2600x 6-core 3.60GHz 内存 16GB IDE/编辑器 PyCharm Python版本 3.6 LDA_test.py代码 import numpy as np from pandas import read_csv import time from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.decomposition import PCA from sklearn.svm import SVC from

6、降维

[亡魂溺海] 提交于 2019-11-27 15:49:53
当特征选择完成后,可以直接训练模型了,但是可能由于特征矩阵过大,导致计算量大,训练时间长的问题,因此降低特征矩阵维度也是必不可少的。常见的降维方法除了以上提到的基于L1惩罚项的模型以外,另外还有主成分分析法(PCA)和线性判别分析(LDA),线性判别分析本身也是一个分类模型。PCA和LDA有很多的相似点,其本质是要将原始的样本映射到维度更低的样本空间中,但是PCA和LDA的映射目标不一样: PCA是为了让映射后的样本具有最大的发散性;而LDA是为了让映射后的样本有最好的分类性能 。所以说PCA是一种无监督的降维方法,而LDA是一种有监督的降维方法。 来源: https://www.cnblogs.com/pacino12134/p/11369036.html

LDA model generates different topics everytime i train on the same corpus

假装没事ソ 提交于 2019-11-27 12:54:09
I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate different topics everytime? And how do i stabilize the topic generation? I'm using this corpus ( http://pastebin.com/WptkKVF0 ) and this list of stopwords ( http://pastebin.com/LL7dqLcj ) and here's my code: from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip from collections import

Topic distribution: How do we see which document belong to which topic after doing LDA in python

不想你离开。 提交于 2019-11-27 11:23:15
I am able to run the LDA code from gensim and got the top 10 topics with their respective keywords. Now I would like to go a step further to see how accurate the LDA algo is by seeing which document they cluster into each topic. Is this possible in gensim LDA? Basically i would like to do something like this, but in python and using gensim. LDA with topicmodels, how can I see which topics different documents belong to? Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this

How does the removeSparseTerms in R work?

耗尽温柔 提交于 2019-11-27 10:09:55
问题 I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix. How does this method work and what is the logic behind it? I understand the concept of sparseness but does this threshold indicate how many documents should a term be present it, or some other ratio, etc? 回答1: In the sense of the sparse argument to removeSparseTerms() , sparsity refers to the

Predicting LDA topics for new data

蹲街弑〆低调 提交于 2019-11-27 09:53:51
问题 It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking protocol by asking a simliar question again, I just assumed that those questions would not be seeing any new answers. Anyway, I am new to Latent Dirichlet Allocation and am exploring its use as a means of dimension reduction for textual data.

LDA with topicmodels, how can I see which topics different documents belong to?

爱⌒轻易说出口 提交于 2019-11-27 07:00:41
I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the highest probability, how can I do that? myCorpus <- Corpus(VectorSource(userbios$bio)) docs <- userbios$twitter_id myCorpus <- tm_map(myCorpus, tolower) myCorpus <- tm_map(myCorpus, removePunctuation) myCorpus <- tm_map(myCorpus, removeNumbers) removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) myStopwords <-

Remove empty documents from DocumentTermMatrix in R topicmodels?

我怕爱的太早我们不能终老 提交于 2019-11-27 06:37:34
I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix: corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removeNumbers) ...snip removing several custom lists of stopwords... corpus <- tm_map(corpus, stemDocument) dtm <- DocumentTermMatrix(corpus, control=list

How to print the LDA topics models from gensim? Python

我们两清 提交于 2019-11-27 05:17:56
问题 Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a NoneType : Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable The code: from gensim import corpora, models, similarities

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

流过昼夜 提交于 2019-11-26 22:31:44
i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. Jason Lenderman As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this: newDocuments: