问题
I start with a dictionary, which is the way my data was already formatted:
import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},
'C':{'b':1.0,'c':2.0, 'd':4.0}}
I then convert it to a pandas dataframe:
df = pd.DataFrame(dict2)
print(df)
A B C
a 1.0 2.0 NaN
b 2.0 NaN 1.0
c NaN 2.0 2.0
d 4.0 5.0 4.0
Of course, I can get the difference one at a time by doing this:
df['A'] - df['B']
Out[643]:
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
I figured out how to loop through and calculate A-A, A-B, A-C:
for column in df:
print(df['A'] - df[column])
a 0.0
b 0.0
c NaN
d 0.0
Name: A, dtype: float64
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
a NaN
b 1.0
c NaN
d 0.0
dtype: float64
What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.
I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.
回答1:
You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum
.
i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5
If you want a DataFrame representing a distance matrix, here's what that would look like:
df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
A B C
A 0.000000 1.414214 1.0
B 1.414214 0.000000 1.0
C 1.000000 1.000000 0.0
df[i, j]
represents the distance between the ith and jth column in the original DataFrame.
回答2:
The code
below iterates through columns to calculate the difference.
# Import libraries
import pandas as pd
import numpy as np
# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()
# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
for j in range (1,len(clist)):
if (clist[i] != clist[j]):
var = clist[i] + '-' + clist[j]
df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional
Output in same dataframe
df.head()
Output in a new dataframe
df2.head()
来源:https://stackoverflow.com/questions/51352699/pairwise-euclidean-distance-with-pandas-ignoring-nans