I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure:
File Header
149 Byte
One glaring inefficiency is the use of hstack
in a loop:
time_series = hstack ( ( time_series , time_stamp ) )
t_series = hstack ( ( t_series , record_t ) )
x_series = hstack ( ( x_series , record_x ) )
y_series = hstack ( ( y_series , record_y ) )
z_series = hstack ( ( z_series , record_z ) )
On every iteration, this allocates a slightly bigger array for each of the series and copies all the data read so far into it. This involves lots of unnecessary copying and can potentially lead to bad memory fragmentation.
I'd accumulate the values of time_stamp
in a list and do one hstack
at the end, and would do exactly the same for record_t
etc.
If that doesn't bring sufficient performance improvements, I'd comment out the body of the loop and would start bringing things back in one a time, to see where exactly the time is spent.