Reading binary and text from same file in Python

一笑奈何 提交于 2019-12-24 05:06:13

问题


How does one read binary and text from the same file in Python? I know how to do each separately, and can imagine doing both very carefully, but not both with the built-in IO library directly.

So I have a file that has a format that has large chunks of UTF-8 text interspersed with binary data. The text does not have a length written before it or a special character like "\0" delineating it from the binary data, there is a large portion of text near the end when parsed means "we are coming to an end".

The optimal solution would be to have the built-in file reading classes have "read(n)" and "read_char(n)" methods, but alas they don't. I can't even open the file twice, once as text and once as binary, since the return value of tell() on the text one can't be used with the binary one in any meaningful way.

So my first idea would be to open the whole file as binary and when I reach a chunk of text, read it "character by character" until I realize that the text is ending and then go back to reading it as binary. However this means that I have to read byte-by-byte and do my own decoding of UTF-8 characters (do I need to read another byte for this character before doing something with it?). If it was a fixed-width character encoding I would just read that many bytes each time. In the end I would also like the universal line endings as supported by the Python text-readers, but that would be even more difficult to implement while reading byte-by-byte.

Another easier solution would be if I could ask the text file object its real offset in the file. That alone would solve all my problems.


回答1:


One way might be to use Hachoir to define a file parsing protocol.

The simple alternative is to open the file in binary mode and manually initialise a buffer and text wrapper around it. You can then switch in and out of binary pretty neatly:

my_file = io.open("myfile.txt", "rb")
my_file_buffer = io.BufferedReader(my_file, buffer_size=1) # Not as performant but a larger buffer will "eat" into the binary data 
my_file_text_reader = io.TextIOWrapper(my_file_buffer, encoding="utf-8")
string_buffer = ""

while True:
    while "near the end" not in string_buffer:
        string_buffer += my_file_text_reader.read(1) # read one Unicode char at a time

    # binary data must be next. Where do we get the binary length from?
    print string_buffer
    data = my_file_buffer.read(3)

    print data
    string_buffer = ""

A quicker, less extensible way might be to use the approach you've suggested in your question by intelligently parsing the text portions, reading each UTF-8 sequence of bytes at a time. The following code (from http://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python), seems to be a neat way to conservatively read UTF-8 bytes into characters from a binary file:

 def get_next_character(f):
     # note: assumes valid utf-8
     c = f.read(1)
     while c:
         while True:
             try:
                 yield c.decode('utf-8')
             except UnicodeDecodeError:
                 # we've encountered a multibyte character
                 # read another byte and try again
                 c += f.read(1)
             else:
                 # c was a valid char, and was yielded, continue
                 c = f.read(1)
                 break

# Usage:
with open("input.txt","rb") as f:
    my_unicode_str = ""
    for c in get_next_character(f):
        my_unicode_str += c


来源:https://stackoverflow.com/questions/32659104/reading-binary-and-text-from-same-file-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!