I have Dataframe which can be simplified to this:
import pandas as pd
df = pd.DataFrame([{
\'title\': \'batman\',
\'text\': \'man bat man bat\',
\'url\': \
Take a look at the following link: http://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
The key value accepts a panda dataframe column label. When using it in your pipeline it can be applied as:
('tfidf_word', Pipeline([
('selector', ItemSelector(key='column_name')),
('tfidf', TfidfVectorizer())),
]))