Nltk naive bayesian classifier memory issue

问题

my first post here! I have problems using the nltk NaiveBayesClassifier. I have a training set of 7000 items. Each training item has a description of 2 or 3 worlds and a code. I would like to use the code as label of the class and each world of the description as features. An example:

"My name is Obama", 001 ...

Training set = {[feature['My']=True,feature['name']=True,feature['is']=True,feature[Obama]=True], 001}

Unfortunately, using this approach, the training procedure NaiveBayesClassifier.train use up to 3 GB of ram.. What's wrong in my approach? Thank you!

def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)

...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
  t = t.split("'")
  code = t[0] #class
  desc = t[1] # description
  words = words.union(s) #update dictionary with the new words in the description
  entries.append((s,code))
  t = readfile.readline()
train_set = classify.util.apply_features(document_features, entries[:train_length])
classifier = NaiveBayesClassifier.train(train_set) # Training

回答1:

Use nltk.classify.apply_features which returns an object that acts like a list but does not store all the feature sets in memory.

from nltk.classify import apply_features

More Information and a Example here

You are loading the file anyway into the memory, you will need to use some form of lazy loading method. Which will load as per need basis. Consider looking into this

来源：https://stackoverflow.com/questions/9723875/nltk-naive-bayesian-classifier-memory-issue

标签

python

nltk

bayesian

classification