Efficient Way to Create Numpy Arrays from Binary Files

后端 未结 4 1949
耶瑟儿~
耶瑟儿~ 2020-12-25 08:45

I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure:

File Header

149 Byte          


        
相关标签:
4条回答
  • 2020-12-25 09:08

    I have got satisfactory results with a similar problem (multi-resolution multi-channel binary data files) by using array, and struct.unpack. In my problem, I wanted continuous data for each channel, but the file had an interval oriented structure, instead of a channel oriented structure.

    The "secret" is to read the whole file first, and only then distribute the known-sized slices to the desired containers (on the code below, self.channel_content[channel]['recording'] is an object of type array):

    f = open(somefilename, 'rb')    
    fullsamples = array('h')
    fullsamples.fromfile(f, os.path.getsize(wholefilename)/2 - f.tell())
    position = 0
    for rec in xrange(int(self.header['nrecs'])):
        for channel in self.channel_labels:
            samples = int(self.channel_content[channel]['nsamples'])
            self.channel_content[channel]['recording'].extend(fullsamples[position:position+samples])
                position += samples
    

    Of course, I cannot state this is better or faster than other answers provided, but at least is something you might evaluate.

    Hope it helps!

    0 讨论(0)
  • 2020-12-25 09:25

    Numpy supports mapping binary from data directly into array like objects via numpy.memmap. You might be able to memmap the file and extract the data you need via offsets.

    For endianness correctness just use numpy.byteswap on what you have read in. You can use a conditional expression to check the endianness of the host system:

    if struct.pack('=f', np.pi) == struct.pack('>f', np.pi):
      # Host is big-endian, in-place conversion
      arrayName.byteswap(True)
    
    0 讨论(0)
  • 2020-12-25 09:26

    Some hints:

    • Don't use the struct module. Instead, use Numpy's structured data types and fromfile. Check here: http://scipy-lectures.github.com/advanced/advanced_numpy/index.html#example-reading-wav-files

    • You can read all of the records at once, by passing in a suitable count= to fromfile.

    Something like this (untested, but you get the idea):

    import numpy as np
    
    file = open(input_file, 'rb')
    header = file.read(149)
    
    # ... parse the header as you did ...
    
    record_dtype = np.dtype([
        ('timestamp', '<i4'), 
        ('samples', '<i2', (sample_rate, 4))
    ])
    
    data = np.fromfile(file, dtype=record_dtype, count=number_of_records)
    # NB: count can be omitted -- it just reads the whole file then
    
    time_series = data['timestamp']
    t_series = data['samples'][:,:,0].ravel()
    x_series = data['samples'][:,:,1].ravel()
    y_series = data['samples'][:,:,2].ravel()
    z_series = data['samples'][:,:,3].ravel()
    
    0 讨论(0)
  • 2020-12-25 09:28

    One glaring inefficiency is the use of hstack in a loop:

      time_series = hstack ( ( time_series , time_stamp ) )
      t_series = hstack ( ( t_series , record_t ) )
      x_series = hstack ( ( x_series , record_x ) )
      y_series = hstack ( ( y_series , record_y ) )
      z_series = hstack ( ( z_series , record_z ) )
    

    On every iteration, this allocates a slightly bigger array for each of the series and copies all the data read so far into it. This involves lots of unnecessary copying and can potentially lead to bad memory fragmentation.

    I'd accumulate the values of time_stamp in a list and do one hstack at the end, and would do exactly the same for record_t etc.

    If that doesn't bring sufficient performance improvements, I'd comment out the body of the loop and would start bringing things back in one a time, to see where exactly the time is spent.

    0 讨论(0)
提交回复
热议问题