I have 3 numpy arrays and need to form the cartesian product between them. Dimensions of the arrays are not fixed, so they can take different values, one example could be A=
The following produces your expected result without relying on an intermediate three times the size of the result. It uses broadcasting.
Please note that almost any NumPy operation is broadcastable like this, so in practice there is probably no need for an explicit cartesian product:
#shared dimensions:
sh = a.shape[1:]
aba = (a[:, None, None] + b[None, :, None] - a[None, None, :]).reshape(-1, *sh)
aba
#array([[ 0. , 0.03],
# [-1. , 0.16],
# [ 1. , -0.1 ],
# [ 0. , 0.03]])
You may consider leaving out the reshape. That would allow you to address the rows in the result by combined index. If your component ID's are just 0,1,2,... like in your example this would be the same as the combined ID. For example aba[1,0,0] would correspond to the row obtained as second row of a + first row of b - first row of a.
Broadcasting: When for example adding two arrays their shapes do not have to be identical, only compatible because of broadcasting. Broadcasting is in a sense a generalization of adding scalars to arrays:
[[2], [[7], [[2],
7 + [3], equiv to [7], + [3],
[4]] [7]] [4]]
Broadcasting:
[[4], [[1, 2, 3], [[4, 4, 4],
[[1, 2, 3]] + [5], equiv to [1, 2, 3], + [5, 5, 5],
[6]] [1, 2, 3]] [6, 6, 6]]
For this to work each dimension of each operand must be either 1 or equal to the corresponding dimension in each other operand unless it is 1. If an operand has fewer dimensions than the others its shape is padded with ones on the left. Note that the equiv arrays shown in the illustration are not explicitly created.
In that case I don't see how you can possibly avoid using storage, so h5py or something like that it is.
This is just a matter of slicing:
a_no_id = a[:, 1:]
etc. Note that, unlike Python lists, NumPy arrays when sliced do not return a copy but a view. Therefore efficiency (memory or runtime) is not an issue here.
An alternate solution is to create a cartesian product of indices (which is easier, as solutions for cartesian products of 1D arrays exist):
idx = cartesian_product(
np.arange(len(a)),
np.arange(len(b)) + len(a),
np.arange(len(a))
)
And then use fancy indexing to create the output array:
x = np.concatenate((a, b))
result = x[idx.ravel(), :].reshape(*idx.shape, -1)
Writing results efficiently on disk
At first a few minds on the size of the resulting data.
Size of the result data
size_in_GB = A.shape[0]**2*A.shape[1]*B.shape[0]*(size_of_datatype)/1e9
In your question you mentioned A.shape=(10000,50), B=(40,50). Using float64 your result will be aproximately 1600 GB. This can be done without problems if you have enough disk space, but you have to think what you wan't to do with the data next. Maybe this is only a intermediate result and processing the data in blocks is possible.
If this is not the case here is an example how to handle 1600GB of data efficiently (RAM usage will be about 200 MB). The troughput should be around 200 MB/s on realistic data.
The code calculating the results is from @PaulPanzer.
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
a=np.arange(500*50).reshape(500, 50)
b=np.arange(40*50).reshape(40, 50)
# isn't well documented, have a look at https://github.com/Blosc/hdf5-blosc
compression_opts=(0, 0, 0, 0, 5, 1, 1)
compression_opts[4]=9 #compression level 0...9
compression_opts[5]=1 #shuffle
compression_opts[6]=1 #compressor (I guess that's lz4)
File_Name_HDF5='Test.h5'
f = h5.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*300)
dset = f.create_dataset('Data', shape=(a.shape[0]**2*b.shape[0],a.shape[1]),dtype='d',chunks=(a.shape[0]*b.shape[0],1),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Write the data
for i in range(a.shape[0]):
sh = a.shape[1:]
aba = (a[i] + b[:, None] - a).reshape(-1, *sh)
dset[i*a.shape[0]*b.shape[0]:(i+1)*a.shape[0]*b.shape[0]]=aba
f.close()
Reading the data
File_Name_HDF5='Test.h5'
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*300)
dset=f['Data']
chunks_size=500
for i in range(0,dset.shape[0],chunks_size):
#Iterate over the first column
data=dset[i:i+chunks_size,:] #avoid excessive calls to the hdf5 library
#Do something with the data
f.close()
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*300)
dset=f['Data']
for i in range(dset.shape[1]):
# Iterate over the second dimension
# fancy indexing e.g.[:,i] will be much slower
# use np.expand_dims or in this case np.squeeze after the read operation from the dset
# if you wan't to have the same result than [:,i] (1 dim array)
data=dset[:,i:i+1]
#Do something with the data
f.close()
On this test example I get a write throughput of about 550 M/s, a read throuhput of about (500 M/s first dim, 1000M/s second dim) and a compression ratio of 50. Numpy memmap will only provide acceptable speed if you read or write data along the fastest changing direction (in C the last dimension), with a chunked data format used by HDF5 here, this isn't a problem at all. Compression is also not possible with Numpy memmap, leading to higher file sizes and slower speed.
Please note that the compression filter and chunk shape have to be set up to your needs. This depends on how you wan't to read the data afterwards and the actual data.
If you do something completely wrong, the perfornance can be 10-100 times slower compared to a proper way to do it (e.g. the chunkshape can be optimized for the first or the second read example).