HDF5 adding numpy arrays slow

问题

First time using hdf5 so could you help me figure out what is wrong, why adding 3d numpy arrays is slow. Preprocessing takes 3s, adding 3d numpy array (100x512x512) 30s and rising with each sample

First I create hdf with:

def create_h5(fname_):
  """
  Run only once
  to create h5 file for dicom images
  """
  f = h5py.File(fname_, 'w', libver='latest') 

  dtype_ = h5py.special_dtype(vlen=bytes)


  num_samples_train = 1397
  num_samples_test = 1595 - 1397
  num_slices = 100

  f.create_dataset('X_train', (num_samples_train, num_slices, 512, 512), 
    dtype=np.int16, maxshape=(None, None, 512, 512), 
    chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('y_train', (num_samples_train,), dtype=np.int16, 
    maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('i_train', (num_samples_train,), dtype=dtype_, 
    maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)          
  f.create_dataset('X_test', (num_samples_test, num_slices, 512, 512), 
    dtype=np.int16, maxshape=(None, None, 512, 512), chunks=True, 
    compression="gzip", compression_opts=4)
  f.create_dataset('y_test', (num_samples_test,), dtype=np.int16, maxshape=(None, ), chunks=True, 
    compression="gzip", compression_opts=4)
  f.create_dataset('i_test', (num_samples_test,), dtype=dtype_, 
    maxshape=(None, ), 
    chunks=True, compression="gzip", compression_opts=4)

  f.flush()
  f.close()
  print('HDF5 file created')

Then I run code updating hdf file:

num_samples_train = 1397
num_samples_test = 1595 - 1397

lbl = pd.read_csv(lbl_fldr + 'stage1_labels.csv')

patients = os.listdir(dicom_fldr)
patients.sort()

f = h5py.File(h5_fname, 'a') #r+ tried

train_counter = -1
test_counter = -1

for sample in range(0, len(patients)):    

    sw_start = time.time()

    pat_id = patients[sample]
    print('id: %s sample: %d \t train_counter: %d test_counter: %d' %(pat_id, sample, train_counter+1, test_counter+1), flush=True)

    sw_1 = time.time()
    patient = load_scan(dicom_fldr + patients[sample])        
    patient_pixels = get_pixels_hu(patient)       
    patient_pixels = select_slices(patient_pixels)

    if patient_pixels.shape[0] != 100:
        raise ValueError('Slices != 100: ', patient_pixels.shape[0])



    row = lbl.loc[lbl['id'] == pat_id]

    if row.shape[0] > 1:
        raise ValueError('Found duplicate ids: ', row.shape[0])

    print('Time preprocessing: %0.2f' %(time.time() - sw_1), flush=True)



    sw_2 = time.time()
    #found test patient
    if row.shape[0] == 0:
        test_counter += 1

        f['X_test'][test_counter] = patient_pixels
        f['i_test'][test_counter] = pat_id
        f['y_test'][test_counter] = -1


    #found train
    else: 
        train_counter += 1

        f['X_train'][train_counter] = patient_pixels
        f['i_train'][train_counter] = pat_id
        f['y_train'][train_counter] = row.cancer

    print('Time saving: %0.2f' %(time.time() - sw_2), flush=True)

    sw_el = time.time() - sw_start
    sw_rem = sw_el* (len(patients) - sample)
    print('Elapsed: %0.2fs \t rem: %0.2fm %0.2fh ' %(sw_el, sw_rem/60, sw_rem/3600), flush=True)


f.flush()
f.close()

回答1:

The slowness is almost certainly due to the compression and chunking. It's hard to get this right. In my past projects I often had to turn off compression because it was too slow, although I have not given up on the idea of compression in HDF5 in general.

First you should try to confirm that compression and chunking are the cause of the performance issues. Turn off chunking and compression (i.e. leave out the chunks=True, compression="gzip", compression_opts=4 parameters) and try again. I suspect it will be a lot faster.

If you want to use compression you must understand how chunking works, because HDF compresses the data chunk-by-chunk. Google it, but at least read the section on chunking from the h5py docs. The following quote is crucial:

Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.

By setting chunks=True you let h5py determine the chunk sizes for you automatically (print the chunks property of the dataset to see what they are). Let's say the chunk size in the first dimension (your sample dimension) is 5 . This would mean that when you add one sample, the underlying HDF library will read all the chunks that contain that sample from disk (so in total it will read the 5 samples completely). For every chunk HDF will read it, uncompress it, add the new data, compress it, and write it back to disk. Needless to say, this is slow. This is mitigated by the fact that HDF has a chunk cache, so that uncompressed chunks can reside in memory. However the chunk cache seems to be rather small (see here), so I think all the chunks are swapped in and out of the cache in every iteration of your for-loop. I couldn't find any setting in h5py to alter the chunk cache size.

You can explicitly set the chunk size by assigning a tuple to the chunks keyword parameter. With all this in mind you can experiment with different chunk sizes. My first experiment would be to set the chunk size in the first (sample) dimension to 1, so that individual samples can be accessed without reading other samples into the cache. Let me know if this helped, I'm curious to know.

Even if you find a chunk size that works well for writing the data, it may still be slow when reading, depending on which slices you read. When choosing the chunk size, keep in mind on how your application typically reads the data. You may have to adapt your file-creation routines to these chunk sizes (e.g. fill your data sets chunk by chunk). Or you can decide that it's simply not worth the effort and create uncompressed HDF5 files.

Finally, I would set shuffle=True in the create_dataset calls. This may get you a better compression ratio. It shouldn't influence the performance however.

回答2:

You have to set up a proper chunk chache size. For example:

You add data multiple times to a HDF5 Dataset which will probably lead to multiple write accesses to chunks. If the chunk-chache is to low it works like this:

reading->decompression->add data->compression->writing

Therefor i recommend you to set up a proper chunk-chache size (default is only 1 MB). This can be done with the low-level API or h5py-chache https://pypi.python.org/pypi/h5py-cache/1.0

Only the line where you open the HDF5 File have to be changed.

Also the number of dimensions from the numpy array which should be added to the dataset should be consistent.

This is

A=np.random.rand(1000,1000)
for i in xrange(0,200):
    dset[:,:,i:i+1]=A[:,:,np.newaxis]

7 times faster on my Notebook than that

A=np.random.rand(1000,1000)
for i in xrange(0,200):
    dset[:,:,i]=A

来源：https://stackoverflow.com/questions/41771992/hdf5-adding-numpy-arrays-slow

标签

python

numpy

hdf5

h5py