问题
I'm working on a learning to rank task, dataset has a column thread_id
which is a group label (stratified data).
In the evaluation phase I must take into account these groups as my scoring function works on a per-thread fashion (e.g. nDCG).
Now, if I implement nDCG with a signature scorer(estimator, X, y)
I can easily pass it to GridSearchCV as scoring function as in the example below:
def my_nDCG(estimator, X, y):
# group by X['thread_id']
# compute the result
return result
splitter = GroupShuffleSplit(...).split(X, groups=X['thread_id'])
cv = GridSearchCV(clf, cv=splitter, scoring=my_nDCG)
GridSearchCV selects the model by calling my_nDCG()
.
Unfortunately, inside my_nDCG
, X doesn't have the thread_id
column as it must be dropped beforehand passing X to fit()
, otherwise I'd train the model using thread_id
as feature.
cv.fit(X.drop('best_answer', axis=1), y)
How can I do this without the terrible workaround of keeping thread_id
apart as global
and merging it with X inside my_nDCG()
?
Is there any other way to use nDCG with scikit-learn? I see scikit supports stratified data but when it comes to model evaluation with stratified data it seems missing proper support.
Edit
Just noticed GridSearchCV.fit() accepts a groups
parameter, in my case it'd still be X['thread_id']
.
At this point I only need to read that param within my custom scoring function. How to do it?
来源:https://stackoverflow.com/questions/43501442/ndcg-as-scoring-function-with-gridsearchcv-and-stratified-data