Stanford NLP ColumnDataClassifier: How to serialize model with only top features?

早过忘川 提交于 2019-12-24 06:47:26

问题


When training a model there is an option called limitFeatures. When I set this feature, say 100 ColumnDataClassifier uses just the top 100 features. However it still serializes all the features to the model.ser.gz. When I deserialize this file in my Java code, my program uses approx. 500M memory. Is there a way to create smaller models with just selected features?

I am using the tool from CLI. But any solution with Java is very welcome as well. Here are the relevant code from the prop file:

useClassFeature=false
1.useSplitWordNGrams=true
1.useSplitWords=true
1.useNGrams=false
1.usePrefixSuffixNGrams=false
1.splitWordsRegexp=\\s+
1.maxWordNGramLeng=5
1.minWordNGramLeng=2
1.binnedLengths=10,20,30,50,75,100,200,300,500
1.useLowercaseSplitWords=true
1.useAdaptL1=true
1.limitFeatures=500
1.l1reg=5.0
featureMinimumSupport=5
featureWeightThreshold=10

来源:https://stackoverflow.com/questions/40685303/stanford-nlp-columndataclassifier-how-to-serialize-model-with-only-top-features

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!