Knn give more weight to specific feature in distance

前端 未结 2 402
借酒劲吻你
借酒劲吻你 2020-12-22 07:17

I\'m using the Kobe Bryant Dataset. I wish to predict the shot_made_flag with KnnRegressor.

I\'ve used game_date to extract year and

相关标签:
2条回答
  • 2020-12-22 07:42

    Just add on Shihab's answer regarding distance computation. Can use scipy pdist as suggested in this post, which is faster and more efficient.

    from scipy.spatial.distance import pdist, minkowski, squareform
    
    # create the custom weight array
    weight = ...
    # calculate pairwise distances, using Minkowski norm with custom weights
    distances = pdist(X, minkowski, 2, weight)
    # reformat the result as a square matrix
    distances_as_2d_matrix = squareform(distances)
    
    0 讨论(0)
  • 2020-12-22 07:51

    First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:

    weight = np.ones((M,))  # M is no of features
    weight[[1,7,10]] = 2    # Increase weight of 1st,7th and 10th features
    weight = weight/weight.sum()  #Normalize weights
    

    You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.

    Now define a distance function, which by guideline have to take two 1D numpy array.

    def my_dist(x,y):
        global weight     #1D array, same shape as x or y
        dist = ((x-y)**2) #1D array, same shape as x or y
        return np.dot(dist,weight)  # a scalar float
    

    And initialize KNeighborsRegressor as:

    knn = KNeighborsRegressor(metric=my_dist)
    

    EDIT: To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -

    dist = np.zeros((len(X),len(X)))  #Computing NXN distance matrix
    for i in range(len(X)):           # You can halve this by using the fact that dist[i,j] = dist[j,i]
        for j in range(len(X)):
            dist[i,j] = my_dist(X[i],X[j])
    
    for k in neighbors:
        print('k: ', k)
        knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed' 
        cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X
    

    I couldn't test it, so let me know if something isn't alright.

    0 讨论(0)
提交回复
热议问题