问题
After receiving the H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.
warning, I changed my code to:
import h5py
import numpy as np
f = h5py.File('myfile.hdf5', mode='r')
foo = f['foo']
bar = f['bar']
N, C, H, W = foo.shape. # (8192, 3, 1080, 1920)
data_foo = np.array(foo[()]) # [()] equivalent to .value
and when I tried to read a (not so) big file of images, I got a Killed: 9
on my terminal, my process was killed because it was consuming too much memory, on the last line of the code, despite that archaic comment of mine there . .
However, my original code:
f = h5py.File('myfile.hdf5', mode='r')
data_foo = f.get('foo').value
# script's logic after that worked, process not killed
worked just fine, except from the issued warning..
Why did my code work?
回答1:
Let me explain what your code is doing, and why you are getting memory errors. First some HDF5/h5py basics. (The h5py docs are an excellent starting point. Check here: h5py QuickStart)
foo = f['foo']
and foo = f.get('foo')
both return a h5py dataset object named 'foo'.(Note: it's more common to see this as foo = f['foo']
, but nothing wrong with the get()
method.) A dataset object is not the same as a NumPy array. Datasets behave like NumPy arrays; both have a shape and a data type, and support array-style slicing. However, when you access a dataset object, you do not read all of the data into memory. As a result, they require less memory to access. This is important when working with large datasets!
This statement returns a Numpy array: data_foo = f.get('foo').value
. The preferred method is data_foo = f['foo'][:]
. (NumPy slicing notation is the way to return a NumPy array from a dataset object. As you discovered, .value
is deprecated.)
This also returns a Numpy array: data_foo = foo[()]
(assuming foo is defined as above).
So, when you enter this equation data_foo = np.array(foo[()])
you are creating a new NumPy array from another array (foo[()]
is the input object). I suspect your process was killed because the amount of memory to create a copy of a (8192, 3, 1080, 1920) array exceeded your system resources. That statement will work for small datasets/arrays. However, it's not good practice.
Here's an example to show how to use the different methods (h5py dataset object vs NumPy array).
h5f = h5py.File('myfile.hdf5', mode='r')
# This returns a h5py object:
foo_ds = h5f['foo']
# You can slice to get elements like this:
foo_slice1 = foo_ds[0,:,:,:] # first row
foo_slice2 = foo_ds[-1,:,:,:] # last row
# This is the recommended method to get a Numpy array of the entire dataset:
foo_arr = h5f['foo'][:]
# or, referencing h5py dataset object above
foo_arr = foo_ds[:]
# you can also create an array with a slice
foo_slice1 = h5f['foo'][0,:,:,:]
# is the same as (from above):
foo_slice1 = foo_ds[0,:,:,:]
来源:https://stackoverflow.com/questions/61464832/why-can-i-process-a-large-file-only-when-i-dont-fix-hdf5-deprecation-warning