Mapping of character encodings to maximum bytes per character

谁都会走 提交于 2019-12-24 03:04:15

问题


I'm looking for a table that maps a given character encoding to the (maximum, in the case of variable length encodings) bytes per character. For fixed-width encodings this is easy enough, though I don't know, in the case of some of the more esoteric encodings, what that width is. For UTF-8 and the like it would also be nice to determine the maximum bytes per character depending on the highest codepoint in a string, but this is less pressing.

For some background (which you can ignore, if you're not familiar with Numpy, I'm working on a prototype for an ndarray subclass that can, with some transparency, represent arrays of encoded bytes (including plain ASCII) as arrays of unicode strings without actually converting the entire array to UCS4 at once. The idea is that the underlying dtype is still an S<N> dtype, where <N> is the (maximum) number of bytes per string in the array. But item lookups and string methods decode the strings on the fly using the correct encoding. A very rough prototype can be seen here, though eventually parts of this will likely be implemented in C. The most important thing for my use case is efficient use of memory, while repeated decoding and re-encoding of strings is acceptable overhead.

Anyways, because the underling dtype is in bytes, that does not tell users anything useful about the lengths of strings that can be written to a given encoded text array. So having such a map for arbitrary encodings would be very useful for improving the user interface if nothing else.

Note: I found an answer to basically the same question that is specific to Java here: How can I programatically determine the maximum size in bytes of a character in a specific charset? However, I haven't been able to find any equivalent in Python, nor a useful database of information whereby I might implement my own.


回答1:


The brute-force approach. Iterate over all possible Unicode characters and track the greatest number of bytes used.

def max_bytes_per_char(encoding):
    max_bytes = 0
    for codepoint in range(0x110000):
        try:
            encoded = chr(codepoint).encode(encoding)
            max_bytes = max(max_bytes, len(encoded))
        except UnicodeError:
            pass
    return max_bytes


>>> max_bytes_per_char('UTF-8')
4



回答2:


Although I accepted @dan04's answer, I am also adding my own answer here that was inspired by @dan04's, but goes a little further in that it gives the widths of encodings for all characters supported by a given encoding, and the character ranges that encode to that width (where a width of 0 means it is unsupported):

from collections import defaultdict

def encoding_ranges(encoding):
    codepoint_ranges = defaultdict(list)
    cur_nbytes = None
    start = 0
    for codepoint in range(0x110000):
        try:
            encoded = chr(codepoint).encode(encoding)
            nbytes = len(encoded)
        except UnicodeError:
            nbytes = 0

        if nbytes != cur_nbytes and cur_nbytes is not None:
            if codepoint - start > 2:
                codepoint_ranges[cur_nbytes].append((start, codepoint))
            else:
                codepoint_ranges[cur_nbytes].extend(range(start, codepoint))

            start = codepoint

        cur_nbytes = nbytes

    codepoint_ranges[cur_nbytes].append((start, codepoint + 1))
    return codepoint_ranges

For example:

>>> encoding_ranges('ascii')
defaultdict(<class 'list'>, {0: [(128, 1114112)], 1: [(0, 128)]})
>>> encoding_ranges('utf8')
defaultdict(<class 'list'>, {0: [(55296, 57344)], 1: [(0, 128)], 2: [(128, 2048)], 3: [(2048, 55296), (57344, 65536)], 4: [(65536, 1114112)]})
>>> encoding_ranges('shift_jis')

For ranges of 2 or fewer characters it just records the codepoints themselves rather than the ranges, which is more useful for more awkward encodings like shift_jis.



来源:https://stackoverflow.com/questions/30870107/mapping-of-character-encodings-to-maximum-bytes-per-character

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!