Pairwise Euclidean distance with pandas ignoring NaNs

情到浓时终转凉″ 提交于 2019-12-10 23:19:58

问题


I start with a dictionary, which is the way my data was already formatted:

import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0}, 
'C':{'b':1.0,'c':2.0, 'd':4.0}}

I then convert it to a pandas dataframe:

df = pd.DataFrame(dict2)
print(df)
     A    B    C
a  1.0  2.0  NaN
b  2.0  NaN  1.0
c  NaN  2.0  2.0
d  4.0  5.0  4.0

Of course, I can get the difference one at a time by doing this:

df['A'] - df['B']
Out[643]: 
a   -1.0
b    NaN
c    NaN
d   -1.0
dtype: float64

I figured out how to loop through and calculate A-A, A-B, A-C:

for column in df:
print(df['A'] - df[column])

a    0.0
b    0.0
c    NaN
d    0.0
Name: A, dtype: float64
a   -1.0
b    NaN
c    NaN
d   -1.0
dtype: float64
a    NaN
b    1.0
c    NaN
d    0.0
dtype: float64

What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.

I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.


回答1:


You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.

i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5

If you want a DataFrame representing a distance matrix, here's what that would look like:

df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
          A         B    C
A  0.000000  1.414214  1.0
B  1.414214  0.000000  1.0
C  1.000000  1.000000  0.0

df[i, j] represents the distance between the ith and jth column in the original DataFrame.




回答2:


The code below iterates through columns to calculate the difference.

# Import libraries
import pandas as pd
import numpy as np

# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()

# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
    for j in range (1,len(clist)):
        if (clist[i] != clist[j]):
            var = clist[i] + '-' + clist[j]
            df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
            df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional

Output in same dataframe

df.head()

Output in a new dataframe

df2.head()



来源:https://stackoverflow.com/questions/51352699/pairwise-euclidean-distance-with-pandas-ignoring-nans

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!