I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure:
File Header
149 Byte
I have got satisfactory results with a similar problem (multi-resolution multi-channel binary data files) by using array
, and struct.unpack
. In my problem, I wanted continuous data for each channel, but the file had an interval oriented structure, instead of a channel oriented structure.
The "secret" is to read the whole file first, and only then distribute the known-sized slices to the desired containers (on the code below, self.channel_content[channel]['recording']
is an object of type array
):
f = open(somefilename, 'rb')
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(wholefilename)/2 - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(fullsamples[position:position+samples])
position += samples
Of course, I cannot state this is better or faster than other answers provided, but at least is something you might evaluate.
Hope it helps!
Numpy supports mapping binary from data directly into array like objects via numpy.memmap. You might be able to memmap the file and extract the data you need via offsets.
For endianness correctness just use numpy.byteswap on what you have read in. You can use a conditional expression to check the endianness of the host system:
if struct.pack('=f', np.pi) == struct.pack('>f', np.pi):
# Host is big-endian, in-place conversion
arrayName.byteswap(True)
Some hints:
Don't use the struct module. Instead, use Numpy's structured data types and fromfile
. Check here: http://scipy-lectures.github.com/advanced/advanced_numpy/index.html#example-reading-wav-files
You can read all of the records at once, by passing in a suitable count= to fromfile
.
Something like this (untested, but you get the idea):
import numpy as np file = open(input_file, 'rb') header = file.read(149) # ... parse the header as you did ... record_dtype = np.dtype([ ('timestamp', '<i4'), ('samples', '<i2', (sample_rate, 4)) ]) data = np.fromfile(file, dtype=record_dtype, count=number_of_records) # NB: count can be omitted -- it just reads the whole file then time_series = data['timestamp'] t_series = data['samples'][:,:,0].ravel() x_series = data['samples'][:,:,1].ravel() y_series = data['samples'][:,:,2].ravel() z_series = data['samples'][:,:,3].ravel()
One glaring inefficiency is the use of hstack
in a loop:
time_series = hstack ( ( time_series , time_stamp ) )
t_series = hstack ( ( t_series , record_t ) )
x_series = hstack ( ( x_series , record_x ) )
y_series = hstack ( ( y_series , record_y ) )
z_series = hstack ( ( z_series , record_z ) )
On every iteration, this allocates a slightly bigger array for each of the series and copies all the data read so far into it. This involves lots of unnecessary copying and can potentially lead to bad memory fragmentation.
I'd accumulate the values of time_stamp
in a list and do one hstack
at the end, and would do exactly the same for record_t
etc.
If that doesn't bring sufficient performance improvements, I'd comment out the body of the loop and would start bringing things back in one a time, to see where exactly the time is spent.