lda

机器学习-LDA主题模型笔记

左心房为你撑大大i 提交于 2019-12-01 02:05:22
LDA常见的应用方向:   信息提取和搜索(语义分析);文档分类/聚类、文章摘要、社区挖掘;基于内容的图像聚类、目标识别(以及其他计算机视觉应用);生物信息数据的应用; 对于朴素贝叶斯模型来说,可以胜任许多文本分类问题,但无法解决语料中一词多义和多词一义的问题--它更像是词法分析,而非语义分析。如果使用词向量作为文档的特征,一词多义和多词一义会造成计算文档间相似度的不准确性。LDA模型通过 增加“主题” 的方式,一定程度的解决上述问题:   一个词可能被映射到多个主题中,即,一词多义。多个词可能被映射到某个主题的概率很高,即,多词一义。 LDA涉及的主要问题 1)共轭先验分布 2)Dirichlet分布 3)LDA模型   Gibbs采样算法学习参数 共轭先验分布   由于x为给定样本,P(x)有时被称为“证据”,仅仅是归一化因子,如果不关心P(θ|x)的具体值,只考察θ取何值时后验概率P(θ|x)最大,则可将分母省去。         在贝叶斯概率理论中,如果后验概率P(θ|x)和先验概率p(θ)满足同样的分布律,那么,先验分布和后验分布被叫做共轭分布,同时,先验分布叫做似然函数的共轭先验分布。 Dirichlet分布   在学习Dirichlet分布之前先复习以下二项分布的最大似然估计:   投硬币试验中,进行N次独立试验,n次朝上,N-n次朝下。假定朝上的概率为p

Google Cloud Dataproc configuration issues

浪子不回头ぞ 提交于 2019-12-01 01:48:57
问题 I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would seem to be related to problematic automatic cluster configuration. My latest attempt uses n1-standard-8 machines (8 cores, 30GB RAM) for both the master and worker nodes (6 workers, so 48 total cores). But when I look at /etc/spark/conf/spark

LDA

北战南征 提交于 2019-11-30 16:17:35
    在 主成分分析(PCA)原理总结 中,我们对降维算法PCA做了总结。这里我们就对另外一种经典的降维方法线性判别分析(Linear Discriminant Analysis, 以下简称LDA)做一个总结。LDA在模式识别领域(比如人脸识别,舰艇识别等图形图像识别领域)中有非常广泛的应用,因此我们有必要了解下它的算法原理。     在学习LDA之前,有必要将其自然语言处理领域的LDA区别开来,在自然语言处理领域, LDA是隐含狄利克雷分布(Latent Dirichlet Allocation,简称LDA),他是一种处理文档的主题模型。我们本文只讨论线性判别分析,因此后面所有的LDA均指线性判别分析。 1. LDA的思想     LDA是一种监督学习的降维技术,也就是说它的数据集的每个样本是有类别输出的。这点和PCA不同。PCA是不考虑样本类别输出的无监督降维技术。LDA的思想可以用一句话概括,就是“投影后类内方差最小,类间方差最大”。什么意思呢? 我们要将数据在低维度上进行投影,投影后希望每一种类别数据的投影点尽可能的接近,而不同类别的数据的类别中心之间的距离尽可能的大。     可能还是有点抽象,我们先看看最简单的情况。假设我们有两类数据 分别为红色和蓝色,如下图所示,这些数据特征是二维的,我们希望将这些数据投影到一维的一条直线,让每一种类别数据的投影点尽可能的接近

How to generate word clouds from LDA models in Python?

北慕城南 提交于 2019-11-30 10:39:16
I am doing some topic modeling on newspaper articles, and have implemented LDA using gensim in Python3. Now I want to create a word cloud for each topic, using the top 20 words for each topic. I know I can print the words, and save the LDA model, but is there any way to just save the top words for each topic which I can further use for generating word clouds? I tried to google it, but could not find anything relevant. Any help is appreciated. Kenneth Orton You can get the topn words from an LDA model using Gensim's built-in method show_topic. lda = models.LdaModel.load('lda.model') for i in

How do you initialize a gensim corpus variable with a csr_matrix?

大城市里の小女人 提交于 2019-11-30 07:27:42
I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words, I don't want to download a corpus as shown in gensim's documentation nor convert X to a dense matrix, since it would consume a lot of memory and the computer could hang. In short, my questions are the following, How do you initialize a gensim corpus given that I have a csr_matrix (sparse) representing the whole corpus? How do you use LDA to extract

Topic Modeling: How do I use my fitted LDA model to predict new topics for a new dataset in R?

烈酒焚心 提交于 2019-11-30 05:26:23
I am using 'lda' package in R for topic modeling. I want to predict new topics(collection of related words in a document) using a fitted Latent Dirichlet Allocation(LDA) model for new dataset. In the process, I came across predictive.distribution() function. But the function takes document_sums as input parameter which is an output of the result after fitting the new model. I need help to understand the use of existing model on new dataset and predict topics. Here is the example code present in the documentation written by Johnathan Chang for the package: Here is the code for it: #Fit a model

Python Gensim: how to calculate document similarity using the LDA model?

自闭症网瘾萝莉.ら 提交于 2019-11-29 20:27:48
I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks! Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query. dictionary = corpora.Dictionary.load('dictionary.dict') corpus = corpora.MmCorpus("corpus.mm") lda = models.LdaModel.load("model.lda") #result from running online lda (training) index

How to generate word clouds from LDA models in Python?

我的梦境 提交于 2019-11-29 15:51:56
问题 I am doing some topic modeling on newspaper articles, and have implemented LDA using gensim in Python3. Now I want to create a word cloud for each topic, using the top 20 words for each topic. I know I can print the words, and save the LDA model, but is there any way to just save the top words for each topic which I can further use for generating word clouds? I tried to google it, but could not find anything relevant. Any help is appreciated. 回答1: You can get the topn words from an LDA model

Assessing/Improving prediction with linear discriminant analysis or logistic regression

℡╲_俬逩灬. 提交于 2019-11-29 15:18:03
问题 I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. Now, on the statistic Q&A, someone suggested that I may use the linear discriminant analysis. Since I don't have any fitcdiscr.m in MATLAB, I'd rather go with lda in R but I cannot use the fit results to predict AUC or whatever I could use. Indeed, I see that fit output vector of lda in R is some sort of vector with

gensim LdaMulticore not multiprocessing?

江枫思渺然 提交于 2019-11-29 10:45:37
When I run gensim's LdaMulticore model on a machine with 12 cores, using: lda = LdaMulticore(corpus, num_topics=64, workers=10) I get a logging message that says using serial LDA version on this node A few lines later, I see another loging message that says training LDA model using 10 processes When I run top, I see 11 python processes have been spawned, but 9 are sleeping, I.e. only one worker is active. The machine has 24 cores, and is not overwhelmed by any means. Why isn't LdaMulticore running in parallel mode? First, make sure you have installed a fast BLAS library , because most of the