Pandas: Apply function to each pair of columns

前提是你 提交于 2021-02-16 16:12:04

问题


Function f(x,y) that takes two Pandas Series and returns a floating point number. I would like to apply f to each pair of columns in a DataFrame D and construct another DataFrame E of the returned values, so that f(D[i],D[j]) is the value of the ith row and jth column. The straightforward solution is to run a nested loop over all pairs of columns:

E = pd.DataFrame([[f(D[i], D[j]) for i in D] for j in D],
                 columns=D.columns, index=D.columns)

But is there a more elegant solution that perhaps would not involve explicit loops?

NB This question is not a dupe of this, despite the similar names.

EDIT A toy example:

D = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=("a","b","c"))
def f(x,y): return x.dot(y)

E
#    a    b    c
#a  66   78   90
#b  78   93  108
#c  90  108  126

回答1:


You can avoid explicit loops by using Numpy's broadcasting.

Combined with np.vectorize() and an explicit signature, that gives us the following:

vf = np.vectorize(f, signature='(n),(n)->()')
result = vf(D.T.values, D.T.values[:, None])

Notes:

  1. you can add some print statement (e.g. print(f'x:\n{x}\ny:\n{y}\n')) in your function, to convince yourself it is doing the right thing.
  2. you function f() is symmetric; if it is not (e.g. def f(x, y): return np.linalg.norm(x - y**2)), which argument is extended with an extra dimension for broadcasting matters. With the expression above, you'll get the same result as you r E. If instead you use result = vf(D.T.values[:, None], D.T.values), then you'll get its transpose.
  3. the result is a numpy array, of course, and if you want it back as a DataFrame, add:
df = pd.DataFrame(result, index=D.columns, columns=D.columns)

BTW, if f() is really the one from your toy example, as I'm sure you already know, you can directly write:

df = D.T.dot(D)

Performance:

Performance-wise, the speed-up using broadcasting and vectorize is roughly 10x (stable over various matrix sizes). By contrast, D.T.dot(D) is more than 700x faster for size (100, 100), but critically it seems that the relative speedup gets even higher with larger sizes (up to 12,000x faster in my tests, for size (200, 1000) resulting in 1M loops). So, as usual, there is a strong incentive to try and find a way to implement your function f() using existing numpy function(s)!



来源:https://stackoverflow.com/questions/46313624/pandas-apply-function-to-each-pair-of-columns

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!