I\'m working with some rather large, dense numpy float arrays that currently reside on disk in PyTables CArrays. I need to be able to perform efficient dot prod
I don't think numpy optimizes dot product for memmap arrays, if you look at the code for matrix multiply, which I got here, you'll see that the function MatrixProduct2 (as currently implemented) computes the values of the result matrix in c memory order:
op = PyArray_DATA(ret); os = PyArray_DESCR(ret)->elsize;
axis = PyArray_NDIM(ap1)-1;
it1 = (PyArrayIterObject *)
PyArray_IterAllButAxis((PyObject *)ap1, &axis);
it2 = (PyArrayIterObject *)
PyArray_IterAllButAxis((PyObject *)ap2, &matchDim);
NPY_BEGIN_THREADS_DESCR(PyArray_DESCR(ap2));
while (it1->index < it1->size) {
while (it2->index < it2->size) {
dot(it1->dataptr, is1, it2->dataptr, is2, op, l, ret);
op += os;
PyArray_ITER_NEXT(it2);
}
PyArray_ITER_NEXT(it1);
PyArray_ITER_RESET(it2);
}
In the above code, op is the return matrix, dot is the 1d dot product function and it1 and it2 are iterators over the input matrices.
That being said, it looks like your code might already be doing the right thing. In this case the optimal performance is actually much better than O(n^3/sprt(M)), you can limit your IO to only reading each item of A once from disk, or O(n). Memmap arrays naturally have to do some caching behind the scene and inner loop operates on it2, so if A is in C-order and the memmap cache is big enough, your code might already be working. You can enforce caching of rows of A explicitly by doing something like:
def my_dot(A, B, C):
for ii in xrange(n):
A_ii = np.array(A[ii, :])
C[ii, :] = A_ii.dot(B)
return C