I recently switched from Matlab
to Python
. While converting one of my lengthy codes, I was surprised to find Python
being very slow. I
You want to get rid of those for
loops. Try this:
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
i, j = np.indices((N, M))
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
You can also do it with broadcasting, which may be even faster, but a little less intuitive coming from MATLAB
.
I got ~5x speed improvement over the meshgrid solution using only broadcasting:
def exampleKernelD(M, x, N, y):
return np.sqrt((x[:,1:] - y[:,1:].T) ** 2 + (x[:,:1] - y[:,:1].T) ** 2)
It has been mentioned that Matlab uses an internal Jit-compiler to get good performance on such tasks. Let's compare Matlabs jit-compiler with a Python jit-compiler (Numba).
Code
import numba as nb
import numpy as np
import math
import time
#If the arrays are somewhat larger it makes also sense to parallelize this problem
#cache ==True may also make sense
@nb.njit(fastmath=True)
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
def exampleKernelB(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = np.meshgrid(y[:, 0], x[:, 0])
x1, y1 = np.meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = np.sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
@nb.njit()
def exampleKernelC(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
#Your test data
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
#compilation on first callable
#can be avoided with cache=True
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
t1=time.time()
for i in range(10_000):
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelB(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
Performance
exampleKernelA: 0.03s
exampleKernelC: 0.03s
exampleKernelB: 1.02s
Matlab_2016b (your code, but 10000 rep., after few runs): 0.165s
Matlab uses commerical MKL library. If you use free python distribution, check whether you have MKL or other high performance blas library used in python or it is the default ones, which could be much slower.
Upon further investigation I have found that using indices
as indicated in the answer is still slower.
Solution: Use meshgrid
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = meshgrid(y[:, 0], x[:, 0])
x1, y1 = meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
Result: Very very fast, 10 times faster than indices
approach. I am getting times which are closer to C.
However: Using meshgrid
with Matlab
beats C
and Numpy
by being 10 times faster than both.
Still wondering why!