Mallet topic modelling

白昼怎懂夜的黑 提交于 2019-12-22 05:26:18

问题


I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined?? thanks in advance


回答1:


In bin/mallet.bat increase value for this line:

set MALLET_MEMORY=1G



回答2:


I'm not sure about scalability of Mallet to big data, but project http://dragon.ischool.drexel.edu/ can store its data in disk backed persistence therefore can scale to unlimited corpus sizes(with low performance of course)




回答3:


The model is still going to be pretty much huge, even if it read it from multiple files. Have you tried increasing the heap size of your java vm?




回答4:


java.lang.outofmemory exception occurs mainly because of insufficient heap space. You can use -Xms and -Xmx to set heap space so that it will not come again.




回答5:


Given the current PC's memory size, it should be easy to use a heap as large as 2GB. You should try the single-machine solution before considering using a cluster.



来源:https://stackoverflow.com/questions/5168342/mallet-topic-modelling

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!