问题
I'm having a speed problem running a python / numypy code. I don't know how to make it faster, maybe someone else?
Assume there is a surface with two triangulation, one fine (..._fine) with M points, one coarse with N points. Also, there's data on the coarse mesh at every point (N floats). I'm trying to do the following:
For every point on the fine mesh, find the k closest points on coarse mesh and get mean value. Short: interpolate data from coarse to fine.
My code right now goes like that. With large data (in my case M = 2e6, N = 1e4) the code runs about 25 minutes, guess due to the explicit for loop not going into numpy. Any ideas how to solve that one with smart indexing? M x N arrays blowing the RAM..
import numpy as np
p_fine.shape => m x 3
p.shape => n x 3
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm(ps-p,axis=1))[:k]])
Cheers!
回答1:
First of all thanks for the detailed help.
First, Divakar, your solutions gave substantial speed-up. With my data, the code ran for just below 2 minutes depending a bit on the chunk size.
I also tried my way around sklearn and ended up with
def sklearnSearch_v3(p, p_fine, k):
neigh = NearestNeighbors(k)
neigh.fit(p)
return data_coarse[neigh.kneighbors(p_fine)[1]].mean(axis=1)
which ended up being quite fast, for my data sizes, I get the following
import numpy as np
from sklearn.neighbors import NearestNeighbors
m,n = 2000000,20000
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 3
yields
%timeit sklearv3(p, p_fine, k)
1 loop, best of 3: 7.46 s per loop
回答2:
Approach #1
We are working with large sized datasets and memory is an issue, so I will try to optimize the computations within the loop. Now, we can use np.einsum to replace np.linalg.norm
part and np.argpartition in place of actual sorting with np.argsort
, like so -
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
out = out/k
Approach #2
Now, as another approach we can also use Scipy's cdist for a fully vectorized solution, like so -
from scipy.spatial.distance import cdist
out = data_coarse[np.argpartition(cdist(p_fine,p),k,axis=1)[:,:k]].mean(1)
But, since we are memory bound here, we can perform these operations in chunks. Basically, we would get chunks of rows from that tall array p_fine
that has millions of rows and use cdist
and thus at each iteration get chunks of output elements instead of just one scalar. With this, we would cut the loop count by the length of that chunk.
So, finally we would have an implementation like so -
out = np.empty((m,))
L = 10 # Length of chunk (to be used as a param)
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].mean(1)
Runtime test
Setup -
# Setup inputs
m,n = 20000,100
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 5
def original_approach(p,p_fine,m,n,k):
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm\
(ps-p,axis=1))[:k]])
return data_fine
def proposed_approach(p,p_fine,m,n,k):
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
return out/k
def proposed_approach_v2(p,p_fine,m,n,k,len_per_iter):
L = len_per_iter
out = np.empty((m,))
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].sum(1)
return out/k
Timings -
In [134]: %timeit original_approach(p,p_fine,m,n,k)
1 loops, best of 3: 1.1 s per loop
In [135]: %timeit proposed_approach(p,p_fine,m,n,k)
1 loops, best of 3: 539 ms per loop
In [136]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=100)
10 loops, best of 3: 63.2 ms per loop
In [137]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=1000)
10 loops, best of 3: 53.1 ms per loop
In [138]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=2000)
10 loops, best of 3: 63.8 ms per loop
So, there's about 2x
improvement with the first proposed approach and 20x
over the original approach with the second one at the sweet spot with the len_per_iter
param set at 1000
. Hopefully this will bring down your 25 minutes runtime to little over a minute. Not bad I guess!
来源:https://stackoverflow.com/questions/39749807/optimize-python-large-arrays-memory-problems