Merging bag-of-words scikits classifier with arbitrary numeric fields

问题

How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?

I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:

classifier = Pipeline([
    ('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])

Whereas my other usage is like:

classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])

How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.

classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])

回答1:

The easy way:

import scipy.sparse

tfidf = Pipeline([
    ('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
    ('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)

X_other = load_your_other_features()

X = scipy.sparse.hstack([X_tfidf, X_other)

clf = LinearSVC().fit(X, y)

The principled solution, which allows you to keep everything in one Pipeline, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion, but it's hard to tell what the code would look like from the information you've given.

(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC()) is useless. LinearSVC does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)

来源：https://stackoverflow.com/questions/20106940/merging-bag-of-words-scikits-classifier-with-arbitrary-numeric-fields

标签

python

classification

scikit-learn