Why can I process a large file only when I don't fix HDF5 deprecation warning?

你。 提交于 2020-05-17 08:46:06

问题


After receiving the H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead. warning, I changed my code to:

import h5py
import numpy as np 

f = h5py.File('myfile.hdf5', mode='r')
foo = f['foo']
bar = f['bar']
N, C, H, W = foo.shape. # (8192, 3, 1080, 1920)
data_foo = np.array(foo[()]) # [()] equivalent to .value

and when I tried to read a (not so) big file of images, I got a Killed: 9 on my terminal, my process was killed because it was consuming too much memory, on the last line of the code, despite that archaic comment of mine there . .

However, my original code:

f = h5py.File('myfile.hdf5', mode='r')
data_foo = f.get('foo').value
# script's logic after that worked, process not killed

worked just fine, except from the issued warning..

Why did my code work?


回答1:


Let me explain what your code is doing, and why you are getting memory errors. First some HDF5/h5py basics. (The h5py docs are an excellent starting point. Check here: h5py QuickStart)

foo = f['foo'] and foo = f.get('foo') both return a h5py dataset object named 'foo'.(Note: it's more common to see this as foo = f['foo'], but nothing wrong with the get() method.) A dataset object is not the same as a NumPy array. Datasets behave like NumPy arrays; both have a shape and a data type, and support array-style slicing. However, when you access a dataset object, you do not read all of the data into memory. As a result, they require less memory to access. This is important when working with large datasets!

This statement returns a Numpy array: data_foo = f.get('foo').value. The preferred method is data_foo = f['foo'][:]. (NumPy slicing notation is the way to return a NumPy array from a dataset object. As you discovered, .value is deprecated.)
This also returns a Numpy array: data_foo = foo[()] (assuming foo is defined as above).
So, when you enter this equation data_foo = np.array(foo[()]) you are creating a new NumPy array from another array (foo[()] is the input object). I suspect your process was killed because the amount of memory to create a copy of a (8192, 3, 1080, 1920) array exceeded your system resources. That statement will work for small datasets/arrays. However, it's not good practice.

Here's an example to show how to use the different methods (h5py dataset object vs NumPy array).

h5f = h5py.File('myfile.hdf5', mode='r')

# This returns a h5py object:
foo_ds = h5f['foo']
# You can slice to get elements like this:
foo_slice1 = foo_ds[0,:,:,:] # first row
foo_slice2 = foo_ds[-1,:,:,:] # last row

# This is the recommended method to get a Numpy array of the entire dataset:
foo_arr = h5f['foo'][:]
# or, referencing h5py dataset object above
foo_arr = foo_ds[:] 
# you can also create an array with a slice
foo_slice1 = h5f['foo'][0,:,:,:] 
# is the same as (from above):
foo_slice1 = foo_ds[0,:,:,:] 


来源:https://stackoverflow.com/questions/61464832/why-can-i-process-a-large-file-only-when-i-dont-fix-hdf5-deprecation-warning

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!