In Python, how do I read in a binary file and loop over each byte of that file?
To read a file — one byte at a time (ignoring the buffering) — you could use the two-argument iter(callable, sentinel) built-in function:
with open(filename, 'rb') as file:
for byte in iter(lambda: file.read(1), b''):
# Do stuff with byte
It calls file.read(1) until it returns nothing b'' (empty bytestring). The memory doesn't grow unlimited for large files. You could pass buffering=0 to open(), to disable the buffering — it guarantees that only one byte is read per iteration (slow).
with-statement closes the file automatically — including the case when the code underneath raises an exception.
Despite the presence of internal buffering by default, it is still inefficient to process one byte at a time. For example, here's the blackhole.py utility that eats everything it is given:
#!/usr/bin/env python3
"""Discard all input. `cat > /dev/null` analog."""
import sys
from functools import partial
from collections import deque
chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)
Example:
$ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py
It processes ~1.5 GB/s when chunksize == 32768 on my machine and only ~7.5 MB/s when chunksize == 1. That is, it is 200 times slower to read one byte at a time. Take it into account if you can rewrite your processing to use more than one byte at a time and if you need performance.
mmap allows you to treat a file as a bytearray and a file object simultaneously. It can serve as an alternative to loading the whole file in memory if you need access both interfaces. In particular, you can iterate one byte at a time over a memory-mapped file just using a plain for-loop:
from mmap import ACCESS_READ, mmap
with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
for byte in s: # length is equal to the current file size
# Do stuff with byte
mmap supports the slice notation. For example, mm[i:i+len] returns len bytes from the file starting at position i. The context manager protocol is not supported before Python 3.2; you need to call mm.close() explicitly in this case. Iterating over each byte using mmap consumes more memory than file.read(1), but mmap is an order of magnitude faster.