NDCG as scoring function with GridSearchCV and stratified data?

问题

I'm working on a learning to rank task, dataset has a column thread_id which is a group label (stratified data). In the evaluation phase I must take into account these groups as my scoring function works on a per-thread fashion (e.g. nDCG).

Now, if I implement nDCG with a signature scorer(estimator, X, y) I can easily pass it to GridSearchCV as scoring function as in the example below:

def my_nDCG(estimator, X, y):
    # group by X['thread_id']
    # compute the result
    return result

splitter = GroupShuffleSplit(...).split(X, groups=X['thread_id'])
cv = GridSearchCV(clf, cv=splitter, scoring=my_nDCG)

GridSearchCV selects the model by calling my_nDCG(). Unfortunately, inside my_nDCG, X doesn't have the thread_id column as it must be dropped beforehand passing X to fit(), otherwise I'd train the model using thread_id as feature.

cv.fit(X.drop('best_answer', axis=1), y)

How can I do this without the terrible workaround of keeping thread_id apart as global and merging it with X inside my_nDCG()?

Is there any other way to use nDCG with scikit-learn? I see scikit supports stratified data but when it comes to model evaluation with stratified data it seems missing proper support.

Edit

Just noticed GridSearchCV.fit() accepts a groups parameter, in my case it'd still be X['thread_id']. At this point I only need to read that param within my custom scoring function. How to do it?

来源：https://stackoverflow.com/questions/43501442/ndcg-as-scoring-function-with-gridsearchcv-and-stratified-data

标签

python

scikit-learn

grid-search

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!