Merging CountVectorizer in Scikit-Learn feature extraction

问题

I am new to scikit-learn and needed some help with something that I have been working on.

I am trying to classify two types of documents (say, type A and type B) using Multinomial Naive Bayes classification. In order to get the term counts for these documents, I am using the CountVectorizer class in sklearn.feature_extraction.text.

The problem is that the two types of documents require different regular expressions to extract tokens (token_pattern parameter to CountVectorization). I can't seem to find a way to first load the training documents of type A and then of type B. Is it possible to do something like:

vecA = CountVectorizer(token_pattern="[a-zA-Z]+", ...)
vecA.fit(list_of_type_A_document_content)
...
vecB = CountVectorizer(token_pattern="[a-zA-Z0-9]+", ...)
vecB.fit(list_of_type_B_document_content)
...
# Somehow merge the two vectorizers results and get the final sparse matrix

回答1:

You can try:

vecA = CountVectorizer(token_pattern="[a-zA-Z]+", ...)
vecA.fit_transform(list_of_type_A_document_content)
vecB = CountVectorizer(token_pattern="[a-zA-Z0-9]+", ...)
vecB.fit_transform(list_of_type_B_document_content)
combined_features = FeatureUnion([('CountVectorizer', vectA),('CountVect', vectB)])
combined_features.transform(test_data)

You can read more about FeatureUnion from http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html

which is available from version 0.13.1

来源：https://stackoverflow.com/questions/37081597/merging-countvectorizer-in-scikit-learn-feature-extraction

标签

python

scikit-learn

feature-extraction