I\'m trying to read a very big Fortran unformatted binary file with python. This file contains 2^30 integers.
I find that the record markers is confusing (the first
Since this question seems to come up often.. this is a python utilty code to scan a binary file and determine if it is (might be) a fortran unformatted sequential access file. It works by trying several header formats. Of course since the "unformatted" format isn't standard there could be other varients but this should hit the most common ones.
note the left brackets are escaped so you might need to change & #060; back to a 'less than' sign if you screen copy this.
def scanfbinary(hformat,file,fsize):
""" scan a file to see if it has the simple structure typical of
an unformatted sequential access fortran binary:
recl1,<data of length recl1 bytes>,recl1,recl2,<data of length recl2 bytes>,recl2 ...
"""
import struct
print 'scan type',hformat,
if 'qQ'.find(hformat[1])>=0: hsize=8
elif 'iIlL'.find(hformat[1])>=0: hsize=4
if hformat[0] == '<': endian='little'
elif hformat[0] == '>': endian='big'
print '(',endian,'endian',hsize,'byte header)',
f.seek(0)
nrec = 0
while fsize > 0:
h0=struct.unpack(hformat,f.read(hsize))[0]
if h0 < 0 : print 'invalid integer ',h0; return 1
if h0 > fsize - 2*hsize:
print 'invalid header size ',h0,' exceeds file size ',fsize
if nrec > 0:print 'odd perhaps a corrupe file?'
return 2
# to read the data replace the next line with code to read h0 bytes..
# eg
# import numpy
# dtype = numpy.dtype('<i')
# record=numpy.fromfile(f,dtype,h0/dtype.itemsize)
f.seek(h0,1)
h=struct.unpack(hformat,f.read(hsize))[0]
if h0!=h : print 'unmatched header'; return 3
nrec+=1
if nrec == 1:print
if nrec < 10:print 'read record',nrec,'size',h
fsize-=(h+2*hsize)
print 'successfully read ',nrec,' records with unformatted fortran header type',hformat
return 0
f=open('binaryfilename','r')
f.seek(0,2)
fsize=f.tell()
res=[scanfbinary(hformat,f,fsize) for hformat in ('<q','>q','<i','>i')]
if res.count(0)==0:
print 'no match found, file size ',fsize, 'starts..'
f.seek(0)
for i in range(0,12): print f.read(2).encode('hex_codec'),
print
Finally, things seem to be more clear.
Here is a Intel Fortran Compiler User and Reference Guides, see the section Record Types:Variable-Length Records.
For a record length greater than 2,147,483,639 bytes, the record is divided into subrecords. The subrecord can be of any length from 1 to 2,147,483,639, inclusive.
The sign bit of the leading length field indicates whether the record is continued or not. The sign bit of the trailing length field indicates the presence of a preceding subrecord. The position of the sign bit is determined by the endian format of the file.
A subrecord that is continued has a leading length field with a sign bit value of 1. The last subrecord that makes up a record has a leading length field with a sign bit value of 0. A subrecord that has a preceding subrecord has a trailing length field with a sign bit value of 1. The first subrecord that makes up a record has a trailing length field with a sign bit value of 0. If the value of the sign bit is 1, the length of the record is stored in twos-complement notation.
After many essays, I realized that I was mislead by twos-complement notation, the record marker just change the sign according to the rules above, instead changing to its twos-complement notation when the sign bit is 1. Anyway it's also possible that my data was created with a diffrent compiler.
Below is the solution.
The data is larger than 2GB, so it's devided into several subrecords. As we see the first record start marker is -2147483639, so the lenth of the first record is 2147483639 which is exactly the maximum length of subrecord, not 2147483640 as I thought nor 2147483638 the twos-complement notation of -2147483639.
If we skip 2147483639 bytes to read the record end marker, you will get 2147483639, as it's the first subrecord whose end marker is positive.
Below is the code to check the record markers:
fp = open(file_path, "rb")
while 1:
prefix, = struct.unpack( '>i', fp.read(4) )
fp.seek(abs(prefix), 1) #or read |prefix| bytes data as you want
suffix, = struct.unpack( '>i', fp.read(4) )
print prefix, suffix
if abs(suffix) - abs(prefix):
print "suffix != prefix!"
break
if prefix > 0: break
And screen prints
-2147483639 2147483639
-2147483639 -2147483639
18 -18
We can see the record begin marker and end marker always are the same except the sign. Length of the three records are 2147483639, 2147483639, 18 bytes, not nessary to be multiple of 4. So the first record ends with the first 3 bytes of certain integer and the second record begins with the rest 1 byte.
I found that using f2py is a more convenient way for python to access fortran data. However, the strange behavior of the record marks remains a question. At least we can avoid diving into (sometimes confusing ) fortran unformatted file structure. And it matches well with numpy.
F2PY Users Guide and Reference Manual is here. Here's a example fortran source file for opening and closing file, reading integer 1-D array and float 2-D array. Note the comments begin with !f2py, they are helpful to make f2py more 'clever'.
To use it, you need wrap it into a module and import into python session. Then you can call these functions just as those python functions.
!ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
!cc cc
!cc FORTRAN MODULE for PYTHON PROGRAM CALLING cc
!cc cc
!ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
!Usage:
! Compile: f2py -c fortio.f90 -m fortio
! Import: from fortio import *
! or import fortio
!Note:
! Big endian: 1; Little endian: 0
!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
SUBROUTINE open_fortran_file(fileUnit, fileName, endian, error)
implicit none
character(len=256) :: fileName
integer*4 :: fileUnit, error, endian
!f2py integer*4 optional, intent(in) :: endian=1
!f2py integer*4 intent(out) :: error
if(endian .NE. 0) then
open(unit=fileUnit, FILE=fileName, form='unformatted', status='old', &
iostat=error, convert='big_endian')
else
open(unit=fileUnit, FILE=fileName, form='unformatted', status='old', &
iostat=error)
endif
END SUBROUTINE
!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
SUBROUTINE read_fortran_integer4(fileUnit, arr, leng)
implicit none
integer*4 :: fileUnit, leng
integer*4 :: arr(leng)
!f2py integer*4 intent(in) :: fileUnit, leng
!f2py integer*4 intent(out), dimension(leng), depend(leng) :: arr(leng)
read(fileUnit) arr
END SUBROUTINE
!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
SUBROUTINE read_fortran_real4(fileUnit, arr, row, col)
implicit none
integer*4 :: fileUnit, row, col
real*4 :: arr(row,col)
!f2py integer*4 intent(in):: fileUnit, row, col
!f2py real*4 intent(out), dimension(row, col), depend(row, col) :: arr(row,col)
read(fileUnit) arr
END SUBROUTINE
!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
SUBROUTINE close_fortran_file(fileUnit, error)
implicit none
integer*4 :: fileUnit, error
!f2py integer*4 intent(in) :: fileUnit
!f2py integer*4 intent(out) :: error
close(fileUnit, iostat=error)
END SUBROUTINE