I have Dataframe which can be simplified to this:
import pandas as pd
df = pd.DataFrame([{
\'title\': \'batman\',
\'text\': \'man bat man bat\',
\'url\': \
@elphz answer is a good intro to how you could use FeatureUnion and FunctionTransformer to accomplish this, but I think it could use a little more detail.
First off I would say you need to define your FunctionTransformer functions such that they can handle and return your input data properly. In this case I assume you just want to pass the DataFrame, but ensure that you get back a properly shaped array for use downstream. Therefore I would propose passing just the DataFrame and accessing by column name. Like so:
def text(X):
return X.text.values
def title(X):
return X.title.values
pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])
pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])
Now, to test the variations of transformers and classifiers. I would propose using a list of transformers and a list of classifiers and simply iterating through them, much like a gridsearch.
tfidf = TfidfVectorizer()
cv = CountVectorizer()
lr = LogisticRegression()
rc = RidgeClassifier()
transformers = [('tfidf', tfidf), ('cv', cv)]
clfs = [lr, rc]
best_clf = None
best_score = 0
for tran1 in transformers:
for tran2 in transformers:
pipe1 = Pipeline(pipe_text.steps + [tran1])
pipe2 = Pipeline(pipe_title.steps + [tran2])
union = FeatureUnion([('text', pipe1), ('title', pipe2)])
X = union.fit_transform(df)
X_train, X_test, y_train, y_test = train_test_split(X, df.label)
for clf in clfs:
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
if score > best_score:
best_score = score
best_est = clf
This is a simple example, but you can see how you could plug in any variety of transformations and classifiers in this way.