Optimal HDF5 dataset chunk shape for reading rows

问题

I have a reasonable size (18GB compressed) HDF5 dataset and am looking to optimize reading rows for speed. Shape is (639038, 10000). I will be reading a selection of rows (say ~1000 rows) many times, located across the dataset. So I can't use x:(x+1000) to slice rows.

Reading rows from out-of-memory HDF5 is already slow using h5py since I have to pass a sorted list and resort to fancy indexing. Is there a way to avoid fancy indexing, or is there a better chunk shape/size I can use?

I have read rules of thumb such as 1MB-10MB chunk sizes and choosing shape consistent what I'm reading. However, building a large number of HDF5 files with different chunk shapes for testing is computationally expensive and very slow.

For each selection of ~1,000 rows, I immediately sum them to get an array of length 10,000. My current dataset looks like this:

'10000': {'chunks': (64, 1000),
          'compression': 'lzf',
          'compression_opts': None,
          'dtype': dtype('float32'),
          'fillvalue': 0.0,
          'maxshape': (None, 10000),
          'shape': (639038, 10000),
          'shuffle': False,
          'size': 2095412704}

What I have tried already:

Rewriting the dataset with chunk shape (128, 10000), which I calculate to be ~5MB, is prohibitively slow.
I looked at dask.array to optimise, but since ~1,000 rows fit easily in memory I saw no benefit.

回答1:

Finding the right chunk cache size

At first I wan't to discuss some general things. It is very important to know that each individual chunk could only be read or written as a whole. The standard chunk-cache size of h5py which can avoid excessive disk I/Os is only one MB per default and should in many cases be increased, which will be discussed later on.

As an example:

We have a dset with shape (639038, 10000), float32 (25,5 GB uncompressed)
we wan't to write our data column wise dset[:,i]=arr and read it row wise arr=dset[i,:]
we choose a completely wrong chunk-shape for this type of work ie (1,10000)

In this case reading speed won't be to bad (although the chunk size is a little small) because we read only the data we are using. But what happens when we write on that dataset? If we access a column one floating point number of each chunk is written. This means we are actually writting the whole dataset (25,5 GB) with every iteration and read the whole dataset every other time. This is because if you modify a chunk, you have to read it first if it is not cached (I asume a chunk-cache-size below 25,5 GB here).

So what can we improve here? In such a case we have to make a compromise between write/read speed and the memory which is used by the chunk-cache.

An assumption which will give both decent/read and write speed:

We choose a chunk-size of (100, 1000)
If we wan't to iterate over the first Dimension we need at least (1000*639038*4 ->2,55 GB) cache to avoid additional IO-overhead as described above and (100*10000*4 -> 0,4 MB).
So we should provide at least 2,6 GB chunk-data-cache in this example. This can be easily done with h5py-cache https://pypi.python.org/pypi/h5py-cache/1.0

Conclusion There is no generally right chunk size or shape, it depends heavily on the task which one to use. Never choose your chunk size or shape without making some minds about the chunk-cache. RAM is orders of magnite faster than the fastest SSD in regards of random read/write.

Regarding your problem I would simply read the random rows, the improper chunk-cache-size is your real problem.

Compare the performance of the following code with your version:

import h5py as h5
import time
import numpy as np
import h5py_cache as h5c

def ReadingAndWriting():
    File_Name_HDF5='Test.h5'

    shape = (639038, 10000)
    chunk_shape=(100, 1000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    #We are using 4GB of chunk_cache_mem here
    f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
    d = f.create_dataset('Test', shape ,dtype='f',chunks=chunk_shape,compression="lzf")

    #Writing columns
    t1=time.time()
    for i in xrange(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array, 1)

    f.close()
    print(time.time()-t1)

    # Reading random rows
    # If we read one row there are actually 100 read, but if we access a row
    # which is already in cache we would see a huge speed up.
    f = h5c.File(File_Name_HDF5,'r',chunk_cache_mem_size=1024**2*4000)
    d = f["Test"]
    for j in xrange(0,639):
        t1=time.time()
        # With more iterations it will be more likely that we hit a already cached row
        inds=np.random.randint(0, high=shape[0]-1, size=1000)
        for i in xrange(0,inds.shape[0]):
            Array=np.copy(d[inds[i],:])
        print(time.time()-t1)
    f.close()


if __name__ == "__main__":
    ReadingAndWriting()

The simplest form of fancy slicing

I wrote in the comments, that I couldn't see this behaivior in recent versions. I was wrong. Compare the following:

import h5py as h5
import time
import numpy as np
import h5py_cache as h5c

def Writing():
    File_Name_HDF5='Test.h5'

    shape = (63903, 10000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    # Writing_1 normal indexing
    f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**3)
    d = f.create_dataset('Test', shape ,dtype='f',chunks=(10000,shape[1]/50))
    t1=time.time()
    for i in xrange(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array,1)

    f.close()
    print(time.time()-t1)

    # Writing_2 simplest form of fancy indexing 
    f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**3)
    d = f.create_dataset('Test', shape ,dtype='f',chunks=(10000,shape[1]/50))
    t1=time.time()
    for i in xrange(0,shape[1]):
        d[:,i]=Array

    f.close()
    print(time.time()-t1)


if __name__ == "__main__":
    Writing()

This gives on my SSD 10,8 seconds for the first version and 55 seconds for the second version.

来源：https://stackoverflow.com/questions/48385256/optimal-hdf5-dataset-chunk-shape-for-reading-rows

标签

python

performance

dataset

hdf5

h5py