Efficient Way to Create Numpy Arrays from Binary Files

后端 未结 4 1948
耶瑟儿~
耶瑟儿~ 2020-12-25 08:45

I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure:

File Header

149 Byte          


        
4条回答
  •  一整个雨季
    2020-12-25 09:28

    One glaring inefficiency is the use of hstack in a loop:

      time_series = hstack ( ( time_series , time_stamp ) )
      t_series = hstack ( ( t_series , record_t ) )
      x_series = hstack ( ( x_series , record_x ) )
      y_series = hstack ( ( y_series , record_y ) )
      z_series = hstack ( ( z_series , record_z ) )
    

    On every iteration, this allocates a slightly bigger array for each of the series and copies all the data read so far into it. This involves lots of unnecessary copying and can potentially lead to bad memory fragmentation.

    I'd accumulate the values of time_stamp in a list and do one hstack at the end, and would do exactly the same for record_t etc.

    If that doesn't bring sufficient performance improvements, I'd comment out the body of the loop and would start bringing things back in one a time, to see where exactly the time is spent.

提交回复
热议问题