Optimize Python: Large arrays, memory problems

问题

I'm having a speed problem running a python / numypy code. I don't know how to make it faster, maybe someone else?

Assume there is a surface with two triangulation, one fine (..._fine) with M points, one coarse with N points. Also, there's data on the coarse mesh at every point (N floats). I'm trying to do the following:

For every point on the fine mesh, find the k closest points on coarse mesh and get mean value. Short: interpolate data from coarse to fine.

My code right now goes like that. With large data (in my case M = 2e6, N = 1e4) the code runs about 25 minutes, guess due to the explicit for loop not going into numpy. Any ideas how to solve that one with smart indexing? M x N arrays blowing the RAM..

import numpy as np

p_fine.shape => m x 3
p.shape => n x 3

data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
    data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm(ps-p,axis=1))[:k]])

Cheers!

回答1:

First of all thanks for the detailed help.

First, Divakar, your solutions gave substantial speed-up. With my data, the code ran for just below 2 minutes depending a bit on the chunk size.

I also tried my way around sklearn and ended up with

def sklearnSearch_v3(p, p_fine, k):
    neigh = NearestNeighbors(k)
    neigh.fit(p)
    return data_coarse[neigh.kneighbors(p_fine)[1]].mean(axis=1)

which ended up being quite fast, for my data sizes, I get the following

import numpy as np
from sklearn.neighbors import NearestNeighbors

m,n = 2000000,20000
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 3

yields

%timeit sklearv3(p, p_fine, k)
1 loop, best of 3: 7.46 s per loop

回答2:

Approach #1

We are working with large sized datasets and memory is an issue, so I will try to optimize the computations within the loop. Now, we can use np.einsum to replace np.linalg.norm part and np.argpartition in place of actual sorting with np.argsort, like so -

out = np.empty((m,))
for i, ps in enumerate(p_fine):
    subs = ps-p
    sq_dists = np.einsum('ij,ij->i',subs,subs)
    out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
out = out/k

Approach #2

Now, as another approach we can also use Scipy's cdist for a fully vectorized solution, like so -

from scipy.spatial.distance import cdist
out = data_coarse[np.argpartition(cdist(p_fine,p),k,axis=1)[:,:k]].mean(1)

But, since we are memory bound here, we can perform these operations in chunks. Basically, we would get chunks of rows from that tall array p_fine that has millions of rows and use cdist and thus at each iteration get chunks of output elements instead of just one scalar. With this, we would cut the loop count by the length of that chunk.

So, finally we would have an implementation like so -

out = np.empty((m,))
L = 10 # Length of chunk (to be used as a param)
num_iter = m//L
for j in range(num_iter):
    p_fine_slice = p_fine[L*j:L*j+L]
    out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
                           (p_fine_slice,p),k,axis=1)[:,:k]].mean(1)

Runtime test

Setup -

# Setup inputs
m,n = 20000,100
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 5

def original_approach(p,p_fine,m,n,k):
    data_fine = np.empty((m,))
    for i, ps in enumerate(p_fine):
        data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm\
                                                 (ps-p,axis=1))[:k]])
    return data_fine

def proposed_approach(p,p_fine,m,n,k):    
    out = np.empty((m,))
    for i, ps in enumerate(p_fine):
        subs = ps-p
        sq_dists = np.einsum('ij,ij->i',subs,subs)
        out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
    return out/k

def proposed_approach_v2(p,p_fine,m,n,k,len_per_iter):
    L = len_per_iter
    out = np.empty((m,))    
    num_iter = m//L
    for j in range(num_iter):
        p_fine_slice = p_fine[L*j:L*j+L]
        out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
                               (p_fine_slice,p),k,axis=1)[:,:k]].sum(1)
    return out/k

Timings -

In [134]: %timeit original_approach(p,p_fine,m,n,k)
1 loops, best of 3: 1.1 s per loop

In [135]: %timeit proposed_approach(p,p_fine,m,n,k)
1 loops, best of 3: 539 ms per loop

In [136]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=100)
10 loops, best of 3: 63.2 ms per loop

In [137]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=1000)
10 loops, best of 3: 53.1 ms per loop

In [138]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=2000)
10 loops, best of 3: 63.8 ms per loop

So, there's about 2x improvement with the first proposed approach and 20x over the original approach with the second one at the sweet spot with the len_per_iter param set at 1000. Hopefully this will bring down your 25 minutes runtime to little over a minute. Not bad I guess!

来源：https://stackoverflow.com/questions/39749807/optimize-python-large-arrays-memory-problems

标签

python

arrays

performance

numpy

large-data