问题
Function f(x,y)
that takes two Pandas Series and returns a floating point number. I would like to apply f
to each pair of columns in a DataFrame D
and construct another DataFrame E
of the returned values, so that f(D[i],D[j])
is the value of the i
th row and j
th column. The straightforward solution is to run a nested loop over all pairs of columns:
E = pd.DataFrame([[f(D[i], D[j]) for i in D] for j in D],
columns=D.columns, index=D.columns)
But is there a more elegant solution that perhaps would not involve explicit loops?
NB This question is not a dupe of this, despite the similar names.
EDIT A toy example:
D = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=("a","b","c"))
def f(x,y): return x.dot(y)
E
# a b c
#a 66 78 90
#b 78 93 108
#c 90 108 126
回答1:
You can avoid explicit loops by using Numpy's broadcasting.
Combined with np.vectorize()
and an explicit signature, that gives us the following:
vf = np.vectorize(f, signature='(n),(n)->()')
result = vf(D.T.values, D.T.values[:, None])
Notes:
- you can add some print statement (e.g.
print(f'x:\n{x}\ny:\n{y}\n')
) in your function, to convince yourself it is doing the right thing. - you function
f()
is symmetric; if it is not (e.g.def f(x, y): return np.linalg.norm(x - y**2)
), which argument is extended with an extra dimension for broadcasting matters. With the expression above, you'll get the same result as you rE
. If instead you useresult = vf(D.T.values[:, None], D.T.values)
, then you'll get its transpose. - the result is a numpy array, of course, and if you want it back as a DataFrame, add:
df = pd.DataFrame(result, index=D.columns, columns=D.columns)
BTW, if f()
is really the one from your toy example, as I'm sure you already know, you can directly write:
df = D.T.dot(D)
Performance:
Performance-wise, the speed-up using broadcasting and vectorize is roughly 10x (stable over various matrix sizes). By contrast, D.T.dot(D)
is more than 700x faster for size (100, 100), but critically it seems that the relative speedup gets even higher with larger sizes (up to 12,000x faster in my tests, for size (200, 1000) resulting in 1M loops). So, as usual, there is a strong incentive to try and find a way to implement your function f()
using existing numpy function(s)!
来源:https://stackoverflow.com/questions/46313624/pandas-apply-function-to-each-pair-of-columns