Python truncating international string

假装没事ソ 提交于 2019-12-01 17:08:13

You need to cut to bytes length, so you need first to .encode('utf-8') your string, and then cut it at a code point boundary.

In UTF-8, ASCII (<= 127) are 1-byte. Bytes with two or more most significant bits set (>= 192) are character-starting bytes; the number of bytes that follow is determined by the number of most significant bits set. Anything else is continuation bytes.

A problem may arise if you cut the multi-byte sequence in the middle; if a character did not fit, it should be cut completely, up to the starting byte.

Here's some working code:

LENGTH_BY_PREFIX = [
  (0xC0, 2), # first byte mask, total codepoint length
  (0xE0, 3), 
  (0xF0, 4),
  (0xF8, 5),
  (0xFC, 6),
]

def codepoint_length(first_byte):
    if first_byte < 128:
        return 1 # ASCII
    for mask, length in LENGTH_BY_PREFIX:
        if first_byte & mask == mask:
            return length
    assert False, 'Invalid byte %r' % first_byte

def cut_to_bytes_length(unicode_text, byte_limit):
    utf8_bytes = unicode_text.encode('UTF-8')
    cut_index = 0
    while cut_index < len(utf8_bytes):
        step = codepoint_length(ord(utf8_bytes[cut_index]))
        if cut_index + step > byte_limit:
            # can't go a whole codepoint further, time to cut
            return utf8_bytes[:cut_index]
        else:
            cut_index += step
    # length limit is longer than our bytes strung, so no cutting
    return utf8_bytes

Now test. If .decode() succeeds, we have made a correct cut.

unicode_text = u"هيك بنكون" # note that the literal here is Unicode

print cut_to_bytes_length(unicode_text, 100).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 10).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 5).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 4).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 3).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 2).decode('UTF-8')

# This returns empty strings, because an Arabic letter
# requires at least 2 bytes to represent in UTF-8.
print cut_to_bytes_length(unicode_text, 1).decode('UTF-8')

You can test that the code works with ASCII as well.

If you have a python unicode value and you want to truncate, the following is a very short, general, and efficient way to do it in Python.

def truncate_unicode_to_byte_limit(src, byte_limit, encoding='utf-8'):
    '''
    truncate a unicode value to fit within byte_limit when encoded in encoding

    src: a unicode
    byte_limit: a non-negative integer
    encoding: a text encoding

    returns a unicode prefix of src guaranteed to fit within byte_limit when
    encoded as encoding.
    '''
    return src.encode(encoding)[:byte_limit].decode(encoding, 'ignore')

So for example:

s = u"""
    هيك بنكون
    ascii
    عيش بجنون تون تون تون هيك بنكون
    عيش بجنون تون تون تون
    أوكي أ
"""

b = truncate_unicode_to_byte_limit(s, 73)
print len(b.encode('utf-8')), b

produces output:

73 
    هيك بنكون
    ascii
    عيش بجنون تون تون تو

For a unicode string s, you would need to use something like len(s.encode('utf-8')) to get its length in bytes. len(s) just returns the number of (unencoded) characters.

Update: After further research I discovered that Python has support for incremental encoding which makes it possible to write a reasonably fast function to trim-off excess characters while avoiding the corruption of any multi-byte encoding sequences within the string. Here's example code using it for this task:

# -*- coding: utf-8 -*-

import encodings
_incr_encoder = encodings.search_function('utf8').incrementalencoder()

def utf8_byte_truncate(text, max_bytes):
    """ truncate utf-8 text string to no more than max_bytes long """
    byte_len = 0
    _incr_encoder.reset()
    for index,ch in enumerate(text):
        byte_len += len(_incr_encoder.encode(ch))
        if byte_len > max_bytes:
            break
    else:
        return text
    return text[:index]

s = u"""
    هيك بنكون
    ascii
    عيش بجنون تون تون تون هيك بنكون
    عيش بجنون تون تون تون
    أوكي أ
"""

print 'initial string:'
print s.encode('utf-8')
print "{} chars, {} bytes".format(len(s), len(s.encode('utf-8')))
print
s2 = utf8_byte_truncate(s, 74)  # trim string
print 'after truncation to no more than 74 bytes:'
# following will raise encoding error exception on any improper truncations
print s2.encode('utf-8')
print "{} chars, {} bytes".format(len(s2), len(s2.encode('utf-8')))

Output:

initial string:

    هيك بنكون
    ascii
    عيش بجنون تون تون تون هيك بنكون
    عيش بجنون تون تون تون
    أوكي أ

98 chars, 153 bytes

after truncation to no more than 74 bytes:

    هيك بنكون
    ascii
    عيش بجنون تون تون تو
49 chars, 73 bytes
Mark Tolonen

Using the algorithm I posted on your other question, this will encode a Unicode string at UTF-8 and truncate only whole UTF-8 sequences to arrive at an encoding length less than or equal to a maximum length:

s = u"""
    هيك بنكون
    ascii
    عيش بجنون تون تون تون هيك بنكون
    عيش بجنون تون تون تون
    أوكي أ
"""

def utf8_lead_byte(b):
    '''A UTF-8 intermediate byte starts with the bits 10xxxxxx.'''
    return (ord(b) & 0xC0) != 0x80

def utf8_byte_truncate(text,max_bytes):
    '''If text[max_bytes] is not a lead byte, back up until a lead byte is
    found and truncate before that character.'''
    utf8 = text.encode('utf8')
    if len(utf8) <= max_bytes:
        return utf8
    i = max_bytes
    while i > 0 and not utf8_lead_byte(utf8[i]):
        i -= 1
    return utf8[:i]

b = utf8_byte_truncate(s,74)
print len(b),b.decode('utf8')

Output

73 
    هيك بنكون
    ascii
    عيش بجنون تون تون تو
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!