Knn give more weight to specific feature in distance

前端未结

关注

 2  402

借酒劲吻你

I\'m using the Kobe Bryant Dataset. I wish to predict the shot_made_flag with KnnRegressor.

I\'ve used game_date to extract year and

相关标签:

2条回答

北海茫月

2020-12-22 07:42

Just add on Shihab's answer regarding distance computation. Can use scipy pdist as suggested in this post, which is faster and more efficient.

from scipy.spatial.distance import pdist, minkowski, squareform

# create the custom weight array
weight = ...
# calculate pairwise distances, using Minkowski norm with custom weights
distances = pdist(X, minkowski, 2, weight)
# reformat the result as a square matrix
distances_as_2d_matrix = squareform(distances)

0 讨论(0)

忘掉有多难

2020-12-22 07:51

First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:

weight = np.ones((M,))  # M is no of features
weight[[1,7,10]] = 2    # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum()  #Normalize weights

You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.

Now define a distance function, which by guideline have to take two 1D numpy array.

def my_dist(x,y):
    global weight     #1D array, same shape as x or y
    dist = ((x-y)**2) #1D array, same shape as x or y
    return np.dot(dist,weight)  # a scalar float

And initialize KNeighborsRegressor as:

knn = KNeighborsRegressor(metric=my_dist)

EDIT: To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -

dist = np.zeros((len(X),len(X)))  #Computing NXN distance matrix
for i in range(len(X)):           # You can halve this by using the fact that dist[i,j] = dist[j,i]
    for j in range(len(X)):
        dist[i,j] = my_dist(X[i],X[j])

for k in neighbors:
    print('k: ', k)
    knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed' 
    cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X

I couldn't test it, so let me know if something isn't alright.

0 讨论(0)