I have two 1 dimensional numpy vectors va and vb which are being used to populate a matrix by passing all pair combinations to a function.
cdist is fast because it is written in highly-optimized C code (as you already pointed out), and it only supports a small predefined set of metrics.
Since you want to apply the operation generically, to any given foo function, you have no choice but to call that function na-times-nb times. That part is not likely to be further optimizable.
What's left to optimize are the loops and the indexing. Some suggestions to try out:
xrange instead of range (if in python2.x. in python3, range is already a generator-like)enumerate, instead of range + explicitly indexingcython or numba, to speed up the looping process.If you can make further assumptions about foo, it might be possible to speed it up further.