问题
I have a data frame with a mixture of numeric (15 fields) and categorical (5 fields) data.
I can create a complete distance matrix of the numeric fields following create distance matrix using own calculation pandas
I want to include the categorical fields as well.
Using as template:
import scipy
from scipy.spatial import distance_matrix
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist
df2=pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})
df2
pd.DataFrame(squareform(pdist(df2.values, lambda u, v: np.sqrt((w*(u-v)**2).sum()))), index=df2.index, columns=df2.index)
in the squareform calculation, I would like to include the test np.where(u[2]==v[2], 0, 10)
(as well as with the other categorical columns)
Hpw do I modify the lambda function to carry out this test as well
Here, the distance between [0,1]
= sqrt((2-1)^2 + (6-5)^2 + (cat - cat)^2)
= sqrt(1 + 1 + 0)
and the distance between [0,2]
= sqrt((3-1)^2 + (7-5)^2 + (dog - cat)^2)
= sqrt(4 + 4 + 100)
etc.
Can anyone suggest how I can implement this algorithm?
回答1:
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform
df2 = pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})
def fun(u,v):
const = 0 if u[2] == v[2] else 10
return np.sqrt((u[0]-v[0])**2 + (u[1]-v[1])**2 + const**2)
pd.DataFrame(squareform(pdist(df2.values, fun)), index=df2.index, columns=df2.index)
Result:
0 1 2 3
0 0.000000 1.414214 10.392305 10.862780
1 1.414214 0.000000 10.099505 10.392305
2 10.392305 10.099505 0.000000 10.099505
3 10.862780 10.392305 10.099505 0.000000
来源:https://stackoverflow.com/questions/57868339/calculate-distance-matrix-with-mixed-categorical-and-numerics