LDA model prediction nonconsistance

问题

I trained a LDA model and load it into the environment to transform the new data:

from pyspark.ml.clustering import LocalLDAModel

lda = LocalLDAModel.load(path)
df = lda.transform(text)

The model will add a new column called topicDistribution. In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice.

May I ask the reason why and how to fix it?

回答1:

LDA uses randomness when training and, depending on the implementation, when infering new data. The implementation in Spark is based on EM MAP inference so I believe it only uses randomness when training the model. This means that the results will be different each time the algorithm is trained and run.

To get the same results when running on the same input and same parameters, you can set the random seed when training the model. For example, to set the random seed to 1:

model = LDA.train(data, k=2, seed=1)

To set the seed when transforming new data, create a parameter map to overwrite the default value (None for seed).

lda = LocalLDAModel.load(path)
paramMap[lda.seed] = 1L
df = lda.transform(text, paramMap)

For more information about overwriting model parameters, see here.

来源：https://stackoverflow.com/questions/47784718/lda-model-prediction-nonconsistance

标签

apache-spark

pyspark

apache-spark-mllib

lda

apache-spark-ml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!