How do I properly create custom text codecs?

后端 未结 2 1835
不知归路
不知归路 2020-12-15 21:11

I\'m digging through some old binaries that contain (among other things) text. Their text frequently uses custom character encodings for Reasons, and I want to be able to re

相关标签:
2条回答
  • 2020-12-15 21:27

    You asked for minimal!

    • Write a encode function and a decode function.
    • Write a "search function" that returns a CodecInfo object constructed from the above encoder and decoder.
    • Use codec.register to register a function that returns the above CodecInfo object.

    Here is an example that converts the lowercase letters a-z to 0-25 in order.

    import codecs
    import string
    
    from typing import Tuple
    
    # prepare map from numbers to letters
    _encode_table = {str(number): bytes(letter, 'ascii') for number, letter in enumerate(string.ascii_lowercase)}
    
    # prepare inverse map
    _decode_table = {ord(v): k for k, v in _encode_table.items()}
    
    
    def custom_encode(text: str) -> Tuple[bytes, int]:
        # example encoder that converts ints to letters
        # see https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
        return b''.join(_encode_table[x] for x in text), len(text)
    
    
    def custom_decode(binary: bytes) -> Tuple[str, int]:
        # example decoder that converts letters to ints
        # see https://docs.python.org/3/library/codecs.html#codecs.Codec.decode
        return ''.join(_decode_table[x] for x in binary), len(binary)
    
    
    def custom_search_function(encoding_name):
        return codecs.CodecInfo(custom_encode, custom_decode, name='Reasons')
    
    
    def main():
    
        # register your custom codec
        # note that CodecInfo.name is used later
        codecs.register(custom_search_function)
    
        binary = b'abcdefg'
        # decode letters to numbers
        text = codecs.decode(binary, encoding='Reasons')
        print(text)
        # encode numbers to letters
        binary2 = codecs.encode(text, encoding='Reasons')
        print(binary2)
        # encode(decode(...)) should be an identity function
        assert binary == binary2
    
    if __name__ == '__main__':
        main()
    

    Running this prints

    $ python codec_example.py
    0123456
    b'abcdefg'
    

    See https://docs.python.org/3/library/codecs.html#codec-objects for details on the Codec interface. In particular, the decode function

    ... decodes the object input and returns a tuple (output object, length consumed).

    whereas the encode function

    ... encodes the object input and returns a tuple (output object, length consumed).

    Note that you should also worry about handling streams, incremental encoding/decoding, as well as error handling. For a more complete example, refer to the hexlify codec that @krs013 mentioned.


    P.S. instead of of codec.decode, you can also use codec.open(..., encoding='Reasons').

    0 讨论(0)
  • 2020-12-15 21:36

    While the online documentation is certainly sparse, you can get a lot more information by looking at the source code. The docstrings and comments are quite clear, and the definitions for the parent classes (Codec, IncrementalEncoder, etc.) are ready to be copy/pasted for a start to your codec (be sure to replace the object in each class definition with the name of the class you're inheriting from). It's also worth looking at the example I linked to in the comments for how to assemble/register it.

    I've been stuck at the same point as you for a while looking through this, so good luck! If I have time in a few days, I'll see about actually making that implementation and pasting/linking to it here.

    0 讨论(0)
提交回复
热议问题