问题
When training a model there is an option called limitFeatures
. When I set this feature, say 100
ColumnDataClassifier uses just the top 100 features. However it still serializes all the features to the model.ser.gz
. When I deserialize this file in my Java code, my program uses approx. 500M memory. Is there a way to create smaller models with just selected features?
I am using the tool from CLI. But any solution with Java is very welcome as well. Here are the relevant code from the prop file:
useClassFeature=false
1.useSplitWordNGrams=true
1.useSplitWords=true
1.useNGrams=false
1.usePrefixSuffixNGrams=false
1.splitWordsRegexp=\\s+
1.maxWordNGramLeng=5
1.minWordNGramLeng=2
1.binnedLengths=10,20,30,50,75,100,200,300,500
1.useLowercaseSplitWords=true
1.useAdaptL1=true
1.limitFeatures=500
1.l1reg=5.0
featureMinimumSupport=5
featureWeightThreshold=10
来源:https://stackoverflow.com/questions/40685303/stanford-nlp-columndataclassifier-how-to-serialize-model-with-only-top-features