I have a script that generates two-dimensional numpy
array
s with dtype=float
and shape on the order of (1e3, 1e6)
. Right
According to my experience, np.save()&np.load() is the fastest solution when trasfering data between hard disk and memory so far. I've heavily relied my data loading on database and HDFS system before I realized this conclusion. My tests shows that: The database data loading(from hard disk to memory) bandwidth could be around 50 MBps(Byets/Second), but the np.load() bandwidth is almost same as my hard disk maximum bandwidth: 2GBps(Byets/Second). Both test environments use the simplest data structure.
And I don't think it's a problem to use several seconds to loading an array with shape: (1e3, 1e6). E.g. Your array shape is (1000, 1000000), its data type is float128, then the pure data size is (128/8)*1000*1,000,000=16,000,000,000=16GBytes and if it takes 4 seconds, Then your data loading bandwidth is 16GBytes/4Seconds = 4GBps. SATA3 maximum bandwidth is 600MBps=0.6GBps, your data loading bandwidth is already 6 times of it, your data loading performance almost could compete with DDR's maximum bandwidth, what else do you want?
So my final conclusion is:
Don't use python's Pickle, don't use any database, don't use any big data system to store your data into hard disk, if you could use np.save() and np.load(). These two functions are the fastest solution to transfer data between harddisk and memory so far.
I've also tested the HDF5 , and found that it's much slower than np.load() and np.save(), so use np.save()&np.load() if you've enough DDR memory in your platfrom.