How do I properly create custom text codecs?

后端 未结 2 1837
不知归路
不知归路 2020-12-15 21:11

I\'m digging through some old binaries that contain (among other things) text. Their text frequently uses custom character encodings for Reasons, and I want to be able to re

2条回答
  •  隐瞒了意图╮
    2020-12-15 21:27

    You asked for minimal!

    • Write a encode function and a decode function.
    • Write a "search function" that returns a CodecInfo object constructed from the above encoder and decoder.
    • Use codec.register to register a function that returns the above CodecInfo object.

    Here is an example that converts the lowercase letters a-z to 0-25 in order.

    import codecs
    import string
    
    from typing import Tuple
    
    # prepare map from numbers to letters
    _encode_table = {str(number): bytes(letter, 'ascii') for number, letter in enumerate(string.ascii_lowercase)}
    
    # prepare inverse map
    _decode_table = {ord(v): k for k, v in _encode_table.items()}
    
    
    def custom_encode(text: str) -> Tuple[bytes, int]:
        # example encoder that converts ints to letters
        # see https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
        return b''.join(_encode_table[x] for x in text), len(text)
    
    
    def custom_decode(binary: bytes) -> Tuple[str, int]:
        # example decoder that converts letters to ints
        # see https://docs.python.org/3/library/codecs.html#codecs.Codec.decode
        return ''.join(_decode_table[x] for x in binary), len(binary)
    
    
    def custom_search_function(encoding_name):
        return codecs.CodecInfo(custom_encode, custom_decode, name='Reasons')
    
    
    def main():
    
        # register your custom codec
        # note that CodecInfo.name is used later
        codecs.register(custom_search_function)
    
        binary = b'abcdefg'
        # decode letters to numbers
        text = codecs.decode(binary, encoding='Reasons')
        print(text)
        # encode numbers to letters
        binary2 = codecs.encode(text, encoding='Reasons')
        print(binary2)
        # encode(decode(...)) should be an identity function
        assert binary == binary2
    
    if __name__ == '__main__':
        main()
    

    Running this prints

    $ python codec_example.py
    0123456
    b'abcdefg'
    

    See https://docs.python.org/3/library/codecs.html#codec-objects for details on the Codec interface. In particular, the decode function

    ... decodes the object input and returns a tuple (output object, length consumed).

    whereas the encode function

    ... encodes the object input and returns a tuple (output object, length consumed).

    Note that you should also worry about handling streams, incremental encoding/decoding, as well as error handling. For a more complete example, refer to the hexlify codec that @krs013 mentioned.


    P.S. instead of of codec.decode, you can also use codec.open(..., encoding='Reasons').

提交回复
热议问题