I\'d like to improve the performance of convolution using python, and was hoping for some insight on how to best go about improving performance.
I am currently usin
Before going to say C with ctypes, I'd suggest running a standalone convolve in C, to see where the limit is.
Similarly for CUDA, cython, scipy.weave ...
Added 7feb: convolve33 8-bit data with clipping takes ~ 20 clock cycles per point, 2 clock cycles per mem access, on my mac g4 pcc with gcc 4.2. Your mileage will vary.
A couple of subtleties:
By the way, google theano convolve => "A convolution op that should mimic scipy.signal.convolve2d, but faster! In development"