inconsistent record marker while reading fortran unformatted file

后端 未结 3 780
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-10 22:49

I\'m trying to read a very big Fortran unformatted binary file with python. This file contains 2^30 integers.

I find that the record markers is confusing (the first

相关标签:
3条回答
  • 2020-12-10 22:58

    Since this question seems to come up often.. this is a python utilty code to scan a binary file and determine if it is (might be) a fortran unformatted sequential access file. It works by trying several header formats. Of course since the "unformatted" format isn't standard there could be other varients but this should hit the most common ones.

    note the left brackets are escaped so you might need to change & #060; back to a 'less than' sign if you screen copy this.

    
    def scanfbinary(hformat,file,fsize):
     """ scan a file to see if it has the simple structure typical of
         an unformatted sequential access fortran binary:
         recl1,<data of length recl1 bytes>,recl1,recl2,<data of length recl2 bytes>,recl2 ...
     """
     import struct
     print 'scan type',hformat,
     if 'qQ'.find(hformat[1])>=0:  hsize=8
     elif 'iIlL'.find(hformat[1])>=0:  hsize=4
     if hformat[0] == '<':  endian='little'
     elif hformat[0] == '>':  endian='big'
     print '(',endian,'endian',hsize,'byte header)',
     f.seek(0)
     nrec = 0
     while fsize > 0:
      h0=struct.unpack(hformat,f.read(hsize))[0]
      if h0 < 0 :   print 'invalid integer ',h0; return 1
      if h0 > fsize - 2*hsize:
       print 'invalid header size ',h0,' exceeds file size ',fsize
       if nrec > 0:print 'odd perhaps a corrupe file?'
       return 2
    # to read the data replace the next line with code to read h0 bytes..
    # eg 
    #  import numpy
    #  dtype = numpy.dtype('<i')
    #  record=numpy.fromfile(f,dtype,h0/dtype.itemsize) 
      f.seek(h0,1)   
      h=struct.unpack(hformat,f.read(hsize))[0]
      if h0!=h :  print 'unmatched header';   return 3
      nrec+=1
      if nrec == 1:print
      if nrec < 10:print 'read record',nrec,'size',h
      fsize-=(h+2*hsize)
     print 'successfully read ',nrec,' records with unformatted fortran header type',hformat
     return 0
    f=open('binaryfilename','r')
    f.seek(0,2)
    fsize=f.tell()
    res=[scanfbinary(hformat,f,fsize) for hformat in ('<q','>q','<i','>i')]
    if res.count(0)==0:
     print 'no match found, file size ',fsize, 'starts..'
     f.seek(0)
     for i in range(0,12): print f.read(2).encode('hex_codec'),
     print 
    
    
    0 讨论(0)
  • 2020-12-10 23:02

    Finally, things seem to be more clear.

    Here is a Intel Fortran Compiler User and Reference Guides, see the section Record Types:Variable-Length Records.

    For a record length greater than 2,147,483,639 bytes, the record is divided into subrecords. The subrecord can be of any length from 1 to 2,147,483,639, inclusive.

    The sign bit of the leading length field indicates whether the record is continued or not. The sign bit of the trailing length field indicates the presence of a preceding subrecord. The position of the sign bit is determined by the endian format of the file.

    A subrecord that is continued has a leading length field with a sign bit value of 1. The last subrecord that makes up a record has a leading length field with a sign bit value of 0. A subrecord that has a preceding subrecord has a trailing length field with a sign bit value of 1. The first subrecord that makes up a record has a trailing length field with a sign bit value of 0. If the value of the sign bit is 1, the length of the record is stored in twos-complement notation.

    After many essays, I realized that I was mislead by twos-complement notation, the record marker just change the sign according to the rules above, instead changing to its twos-complement notation when the sign bit is 1. Anyway it's also possible that my data was created with a diffrent compiler.

    Below is the solution.

    The data is larger than 2GB, so it's devided into several subrecords. As we see the first record start marker is -2147483639, so the lenth of the first record is 2147483639 which is exactly the maximum length of subrecord, not 2147483640 as I thought nor 2147483638 the twos-complement notation of -2147483639.

    If we skip 2147483639 bytes to read the record end marker, you will get 2147483639, as it's the first subrecord whose end marker is positive.

    Below is the code to check the record markers:

    fp = open(file_path, "rb")
    while 1:
        prefix, = struct.unpack( '>i', fp.read(4) )
        fp.seek(abs(prefix), 1)    #or read |prefix| bytes data as you want
        suffix, = struct.unpack( '>i', fp.read(4) )
        print prefix, suffix
        if abs(suffix) - abs(prefix): 
            print "suffix != prefix!"
            break
        if prefix > 0: break
    

    And screen prints

    -2147483639 2147483639
    -2147483639 -2147483639
    18 -18
    

    We can see the record begin marker and end marker always are the same except the sign. Length of the three records are 2147483639, 2147483639, 18 bytes, not nessary to be multiple of 4. So the first record ends with the first 3 bytes of certain integer and the second record begins with the rest 1 byte.

    0 讨论(0)
  • 2020-12-10 23:15

    I found that using f2py is a more convenient way for python to access fortran data. However, the strange behavior of the record marks remains a question. At least we can avoid diving into (sometimes confusing ) fortran unformatted file structure. And it matches well with numpy.

    F2PY Users Guide and Reference Manual is here. Here's a example fortran source file for opening and closing file, reading integer 1-D array and float 2-D array. Note the comments begin with !f2py, they are helpful to make f2py more 'clever'.

    To use it, you need wrap it into a module and import into python session. Then you can call these functions just as those python functions.

    !ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
    !cc                                                         cc
    !cc      FORTRAN MODULE for PYTHON PROGRAM CALLING          cc
    !cc                                                         cc
    !ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
    
    !Usage: 
    !   Compile:   f2py -c fortio.f90 -m fortio
    !   Import:    from fortio import *
    !       or     import fortio
    !Note:
    !   Big endian: 1; Little endian: 0
    
    
    !cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
    SUBROUTINE open_fortran_file(fileUnit, fileName, endian, error)
      implicit none
    
      character(len=256) :: fileName
      integer*4 :: fileUnit, error, endian
      !f2py integer*4 optional, intent(in) :: endian=1
      !f2py integer*4 intent(out) :: error
    
      if(endian .NE. 0) then
         open(unit=fileUnit, FILE=fileName, form='unformatted', status='old', &
              iostat=error, convert='big_endian')
      else
         open(unit=fileUnit, FILE=fileName, form='unformatted', status='old', &
              iostat=error)
      endif
    END SUBROUTINE 
    
    !cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
    SUBROUTINE read_fortran_integer4(fileUnit, arr, leng)
      implicit none
    
      integer*4 :: fileUnit, leng
      integer*4 :: arr(leng)
      !f2py integer*4 intent(in) :: fileUnit, leng 
      !f2py integer*4 intent(out), dimension(leng), depend(leng) :: arr(leng)
    
      read(fileUnit) arr
    END SUBROUTINE
    
    !cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
    SUBROUTINE read_fortran_real4(fileUnit, arr, row, col)
      implicit none
    
      integer*4 :: fileUnit, row, col
      real*4 :: arr(row,col)
      !f2py integer*4 intent(in):: fileUnit, row, col
      !f2py real*4 intent(out), dimension(row, col), depend(row, col) :: arr(row,col)
    
      read(fileUnit) arr
    END SUBROUTINE
    
    !cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
    SUBROUTINE close_fortran_file(fileUnit, error)
      implicit none
    
      integer*4 :: fileUnit, error
      !f2py integer*4 intent(in) :: fileUnit
      !f2py integer*4 intent(out) :: error
    
      close(fileUnit, iostat=error)
    END SUBROUTINE 
    
    0 讨论(0)
提交回复
热议问题