How to read part of binary file with numpy?

后端 未结 4 784
春和景丽
春和景丽 2020-12-18 21:24

I\'m converting a matlab script to numpy, but have some problems with reading data from a binary file. Is there an equivelent to fseek when using fromfile

4条回答
  •  遥遥无期
    2020-12-18 21:58

    There probably is a better answer… But when I've been faced with this problem, I had a file that I already wanted to access different parts of separately, which gave me an easy solution to this problem.

    For example, say chunkyfoo.bin is a file consisting of a 6-byte header, a 1024-byte numpy array, and another 1024-byte numpy array. You can't just open the file and seek 6 bytes (because the first thing numpy.fromfile does is lseek back to 0). But you can just mmap the file and use fromstring instead:

    with open('chunkyfoo.bin', 'rb') as f:
        with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
            a1 = np.fromstring(m[6:1030])
            a2 = np.fromstring(m[1030:])
    

    This sounds like exactly what you want to do. Except, of course, that in real life the offset and length to a1 and a2 probably depend on the header, rather than being fixed comments.

    The header is just m[:6], and you can parse that by explicitly pulling it apart, using the struct module, or whatever else you'd do once you read the data. But, if you'd prefer, you can explicitly seek and read from f before constructing m, or after, or even make the same calls on m, and it will work, without affecting a1 and a2.

    An alternative, which I've done for a different non-numpy-related project, is to create a wrapper file object, like this:

    class SeekedFileWrapper(object):
        def __init__(self, fileobj):
            self.fileobj = fileobj
            self.offset = fileobj.tell()
        def seek(self, offset, whence=0):
            if whence == 0:
                offset += self.offset
            return self.fileobj.seek(offset, whence)
        # ... delegate everything else unchanged
    

    I did the "delegate everything else unchanged" by generating a list of attributes at construction time and using that in __getattr__, but you probably want something less hacky. numpy only relies on a handful of methods of the file-like object, and I think they're properly documented, so just explicitly delegate those. But I think the mmap solution makes more sense here, unless you're trying to mechanically port over a bunch of explicit seek-based code. (You'd think mmap would also give you the option of leaving it as a numpy.memmap instead of a numpy.array, which lets numpy have more control over/feedback from the paging, etc. But it's actually pretty tricky to get a numpy.memmap and an mmap to work together.)

提交回复
热议问题