Choosing an sklearn pipeline for classifying user text data

问题

I'm working on a machine learning application in Python (using the sklearn module), and am currently trying to decide on a model for performing inference. A brief description of the problem:

Given many instances of user data, I'm trying to classify them into various categories based on relative keyword containment. It is supervised, so I have many, many instances of pre-classified data that are already categorized. (Each piece of data is between 2 and 12 or so words.)

I am currently trying to decide between two potential models:

CountVectorizer + Multinomial Naive Bayes. Use sklearn's CountVectorizer to obtain keyword counts across the training data. Then, use Naive Bayes to classify data using sklearn's MultinomialNB model.
Use tf-idf term weighting on keyword counts + standard Naive Bayes. Obtain a keyword count matrix for the training data using CountVectorizer, transform that data to be tf-idf weighted using sklearn's TfidfTransformer, and then dump that into a standard Naive Bayes model.

I've read through the documentation for the classes use in both methods, and both seem to address my problem very well.

Are there any obvious reasons for why tf-idf weighting with a standard Naive Bayes model might outperform a multinomial Naive Bayes for this type of problem? Are there any glaring issues with either approach?

回答1:

Naive Bayes and MultinomialNB are the same algorithms. The difference that you get is from the tfidf transformation which penalises the words that occur in lots of documents in your corpus.

My advice: Use tfidf and tune the sublinear_tf, binary parameters and normalization parameters of TfidfVectorization for features.

Also try all kind of different classifiers available in scikit-learn which i suspect will give you better results if you properly tune the value of regularization type (penalty eighther l1 or l2) and the regularization parameter (alpha).

If you tune them properly I suspect you can get much better results using SGDClassifier with 'log' loss (Logistic Regression) or 'hinge' loss (SVM).

The way people usually tune the parameters is through GridSearchCV class in scikit-learn.

回答2:

I agree with a comment from David. You would want to train different models to see which is best.

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

from pprint import pprint

df = pd.DataFrame({'Keyword': ['buy widget', 'buy widgets', 'fiberglass widget',
                               'fiberglass widgets', 'how much are widget',
                               'how much are widgets', 'installing widget',
                               'installing widgets', 'vinyl widget', 'vinyl widgets',
                               'widget cost', 'widget estimate', 'widget install',
                               'widget installation', 'widget price', 'widget pricing',
                               'widgets cost', 'widgets estimate', 'widgets install',
                               'widgets installation', 'widgets price', 'widgets pricing',
                               'wood widget', 'wood widgets'],
                   'Label': ['Buy', 'Buy', 'Fiberglass', 'Fiberglass', 'Cost', 'Cost',
                             'Install', 'Install', 'Vinyl', 'Vinyl', 'Cost', 'Estimate',
                             'Install', 'Install', 'Cost', 'Cost', 'Cost', 'Estimate',
                             'Install', 'Install', 'Cost', 'Cost', 'Wood', 'Wood']},
                  columns=['Label', 'Keyword'])

X = df['Keyword']
y = df['Label']

##pipeline = Pipeline(steps=[
##  ('cvect', CountVectorizer()),
##  ('mnb', MultinomialNB())
##  ])

pipeline = Pipeline(steps=[
  ('tfidf', TfidfVectorizer()),
  ('bnb', BernoulliNB())
  ])

parameters = {'tfidf__ngram_range': [(1,1), (1,2)],
              'tfidf__stop_words': [None, 'english'],
              'tfidf__use_idf': [True, False],
              'bnb__alpha': [0.0, 0.5, 1.0],
              'bnb__binarize': [None, 0.2, 0.5, 0.7, 1.0],
              'bnb__fit_prior': [True, False]}

grid = GridSearchCV(pipeline, parameters, scoring='accuracy', cv=2, verbose=1)
grid.fit(X, y)

print('Best score:', grid.best_score_)
print('Best parameters:', pprint(grid.best_params_, indent=2))

# Here's how to predict (uncomment)
#pred = grid.predict(['buy wood widget', 'how much is a widget'])

#print(pred)

来源：https://stackoverflow.com/questions/34735016/choosing-an-sklearn-pipeline-for-classifying-user-text-data

标签

python

machine-learning

scikit-learn

feature-selection