Using topic modeling Java toolkit

问题

I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news.

I saw two Java toolkits: mallet and lingpipe. I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it?

Also read a little about the lingpipe, the example from tutorial was using arrays of integers. Is it convenient for large data?

I need to know which implementation of LDA is better for me? Are there any other implementation that suits my data? (in Java)

回答1:

From the keyword-weight file you can create an artificial text containing the words in random order with the given weights. Run mallet on the so-generated texts to retrieve the topics.

来源：https://stackoverflow.com/questions/28585075/using-topic-modeling-java-toolkit

标签

topic-modeling

mallet

lingpipe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!