Distance calculation between rows in Pandas Dataframe using a distance matrix

前端 未结 3 936
小鲜肉
小鲜肉 2020-12-31 10:04

I have the following Pandas DataFrame:

In [31]:
import pandas as pd
sample = pd.DataFrame({\'Sym1\': [\'a\',\'a\',\'a\',\'d\'],\'Sym2\':[\'a\',\'c\',\'b\',\'         


        
3条回答
  •  清歌不尽
    2020-12-31 10:37

    For a large data, I found a fast way to do this. Assume your data is already in np.array format, named as a.

    from sklearn.metrics.pairwise import euclidean_distances
    dist = euclidean_distances(a, a)
    

    Below is an experiment to compare the time needed for two approaches:

    a = np.random.rand(1000,1000)
    import time 
    time1 = time.time()
    distances = pdist(a, metric='euclidean')
    dist_matrix = squareform(distances)
    time2 = time.time()
    time2 - time1  #0.3639109134674072
    
    time1 = time.time()
    dist = euclidean_distances(a, a)
    time2 = time.time()
    time2-time1  #0.08735871315002441
    

提交回复
热议问题