Vectorizing for loop with repeated indices in python

问题

I am trying to optimize a snippet that gets called a lot (millions of times) so any type of speed improvement (hopefully removing the for-loop) would be great.

I am computing a correlation function of some j'th particle with all others

C_j(|r-r'|) = sqrt(E((s_j(r')-s_k(r))^2)) averaged over k.

My idea is to have a variable corrfun which bins data into some bins (the r, defined elsewhere). I find what bin of r each s_k belongs to and this is stored in ind. So ind[0] is the index of r (and thus the corrfun) for which the j=0 point corresponds to. Multiple points can fall into the same bin (in fact I want bins to be big enough to contain multiple points) so I sum together all of the (s_j(r')-s_k(r))^2 and then divide by number of points in that bin (stored in variable rw). The code I ended up making for this is the following (np is for numpy):

for k, v in enumerate(ind):
        if j==k:
            continue
        corrfun[v] += (s[k]-s[j])**2
        rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))

Note, the rw2 business was because I want to avoid divide by 0 problems but I do return the rw array and I want to be able to differentiate between the rw=0 and rw=1 elements. Perhaps there is a more elegant solution for this as well.

Is there a way to make the for-loop faster? While I would like to not add the self interaction (j==k) I am even ok with having self interaction if it means I can get significantly faster calculation (length of ind ~ 1E6 so self interaction is probably insignificant anyways).

Thank you!

Ilya

Edit:

Here is the full code. Note, in the full code I am averaging over j as well.

import numpy as np

def twopointcorr(x,y,s,dr):

    width = np.max(x)-np.min(x)
    height = np.max(y)-np.min(y)

    n = len(x)

    maxR = np.sqrt((width/2)**2 + (height/2)**2)

    r = np.arange(0, maxR, dr)
    print(r)
    corrfun = r*0
    rw = r*0
    print(maxR)
    ''' go through all points'''
    for j in range(0, n-1):
        hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
        ind = [np.abs(r-h).argmin() for h in hypot]

        for k, v in enumerate(ind):
            if j==k:
                continue
            corrfun[v] += (s[k]-s[j])**2
            rw[v] += 1

    rw2 = rw
    rw2[rw < 1] = 1
    corrfun = np.sqrt(np.divide(corrfun, rw2))
    return r, corrfun, rw

I debug test it the following way

from twopointcorr import twopointcorr
import numpy as np
import matplotlib.pyplot as plt
import time

n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)

print('running two point corr functinon')

start_time = time.time()
r,corrfun,rw = twopointcorr(x,y,s,0.1)
print("--- Execution time is %s seconds ---" % (time.time() - start_time))

fig1=plt.figure()
plt.plot(r, corrfun,'-x')

fig2=plt.figure()
plt.plot(r, rw,'-x')
plt.show()

Again, the main issue is that in the real dataset n~1E6. I can resample to make it smaller, of course, but I would love to actually crank through the dataset.

回答1:

Here is the code that use broadcast, hypot, round, bincount to remove all the loops:

def twopointcorr2(x, y, s, dr):
    width = np.max(x)-np.min(x)
    height = np.max(y)-np.min(y)
    n = len(x)
    maxR = np.sqrt((width/2)**2 + (height/2)**2)
    r = np.arange(0, maxR, dr)    
    osub = lambda x:np.subtract.outer(x, x)

    ind = np.clip(np.round(np.hypot(osub(x), osub(y)) / dr), 0, len(r)-1).astype(int)
    rw = np.bincount(ind.ravel())
    rw[0] -= len(x)
    corrfun = np.bincount(ind.ravel(), (osub(s)**2).ravel())
    return r, corrfun, rw

to compare, I modified your code as follows:

def twopointcorr(x,y,s,dr):
    width = np.max(x)-np.min(x)
    height = np.max(y)-np.min(y)

    n = len(x)

    maxR = np.sqrt((width/2)**2 + (height/2)**2)

    r = np.arange(0, maxR, dr)
    corrfun = r*0
    rw = r*0
    for j in range(0, n):
        hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
        ind = [np.abs(r-h).argmin() for h in hypot]
        for k, v in enumerate(ind):
            if j==k:
                continue
            corrfun[v] += (s[k]-s[j])**2
            rw[v] += 1

    return r, corrfun, rw

and here is the code to check the results:

import numpy as np

n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)

r1, corrfun1, rw1 = twopointcorr(x,y,s,0.1)
r2, corrfun2, rw2 = twopointcorr2(x,y,s,0.1)

assert np.allclose(r1, r2)
assert np.allclose(corrfun1, corrfun2)
assert np.allclose(rw1, rw2)

and the %timeit results:

%timeit twopointcorr(x,y,s,0.1)
%timeit twopointcorr2(x,y,s,0.1)

outputs:

1 loop, best of 3: 5.16 s per loop
10 loops, best of 3: 134 ms per loop

回答2:

Your original code on my system runs in about 5.7 seconds. I fully vectorized the inner loop and got it to run in 0.39 seconds. Simply replace your "go through all points" loop with this:

    points = np.column_stack((x,y))
    hypots = scipy.spatial.distance.cdist(points, points)
    inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)

    # go through all points            
    for j in range(n): # n.b. previously n-1, not sure why
        ind = inds[j]

        np.add.at(corrfun, ind, (s - s[j])**2)

        np.add.at(rw, ind, 1)
        rw[ind[j]] -= 1 # subtract self

The first observation was that your hypot code was computing 2D distances, so I replaced that with cdist from SciPy to do it all in a single call. The second was that the inner for loop was slow, and thanks to an insightful comment from @hpaulj I vectorized that as well using np.add.at().

Since you asked how to vectorize the inner loop as well, I did that later. It now takes 0.25 seconds to run, for a total speedup of over 20x. Here's the final code:

    points = np.column_stack((x,y))
    hypots = scipy.spatial.distance.cdist(points, points)
    inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)

    sn = np.tile(s, (n,1)) # n copies of s                                                                              
    diffs = (sn - sn.T)**2 # squares of pairwise differences
    np.add.at(corrfun, inds, diffs)

    rw = np.bincount(inds.flatten(), minlength=len(r))
    np.subtract.at(rw, inds.diagonal(), 1) # subtract self

This uses more memory but does produce a substantial speedup vs. the single-loop version above.

回答3:

Ok, so as it turns out outer products are incredibly memory expensive, however, using answers from @HYRY and @JohnZwinck i was able to make code that is still roughly linear in n in memory and computes fast (0.5 seconds for the test case)

import numpy as np

def twopointcorr(x,y,s,dr,maxR=-1):

    width = np.max(x)-np.min(x)
    height = np.max(y)-np.min(y)

    n = len(x)

    if maxR < dr:
        maxR = np.sqrt((width/2)**2 + (height/2)**2)

    r = np.arange(0, maxR+dr, dr)

    corrfun = r*0
    rw = r*0


    for j in range(0, n):

        ind = np.clip(np.round(np.hypot(x[j]-x,y[j]-y) / dr), 0, len(r)-1).astype(int)
        np.add.at(corrfun, ind, (s - s[j])**2)
        np.add.at(rw, ind, 1)

    rw[0] -= n

    corrfun = np.sqrt(np.divide(corrfun, np.maximum(rw,1)))
    r=np.delete(r,-1)
    rw=np.delete(rw,-1)
    corrfun=np.delete(corrfun,-1)
    return r, corrfun, rw

来源：https://stackoverflow.com/questions/36853770/vectorizing-for-loop-with-repeated-indices-in-python

标签

python

for-loop

numpy

enumerate