How do I properly combine numerical features with text (bag of words) in scikit-learn?
问题 I am writing a classifier for web pages, so I have a mixture of numerical features, and I also want to classify the text. I am using the bag-of-words approach to transform the text into a (large) numerical vector. The code ends up being like this: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer import numpy as np numerical_features = [ [1, 0], [1, 1], [0, 0], [0, 1] ] corpus = [ 'This is the first document.', 'This is