How can I determine the byte length of a utf-8 encoded string in Python?

前端 未结 3 637
自闭症患者
自闭症患者 2020-12-29 23:33

I am working with Amazon S3 uploads and am having trouble with key names being too long. S3 limits the length of the key by bytes, not characters.

From the docs:

相关标签:
3条回答
  • 2020-12-29 23:42

    Use the string 'encode' method to convert from a character-string to a byte-string, then use len() like normal:

    >>> s = u"¡Hola, mundo!"                                                      
    >>> len(s)                                                                    
    13 # characters                                                                             
    >>> len(s.encode('utf-8'))   
    14 # bytes
    
    0 讨论(0)
  • 2020-12-29 23:48

    Encoding the string and using len on the result works great, as other answers have shown. It does need to build a throw-away copy of the string - if you're working with very large strings this might not be optimal (I don't consider 1024 bytes to be large though). The structure of UTF-8 allows you to get the length of each character very easily without even encoding it, although it might still be easier to encode a single character. I present both methods here, they should give the same result.

    def utf8_char_len_1(c):
        codepoint = ord(c)
        if codepoint <= 0x7f:
            return 1
        if codepoint <= 0x7ff:
            return 2
        if codepoint <= 0xffff:
            return 3
        if codepoint <= 0x10ffff:
            return 4
        raise ValueError('Invalid Unicode character: ' + hex(codepoint))
    
    def utf8_char_len_2(c):
        return len(c.encode('utf-8'))
    
    utf8_char_len = utf8_char_len_1
    
    def utf8len(s):
        return sum(utf8_char_len(c) for c in s)
    
    0 讨论(0)
  • 2020-12-30 00:01
    def utf8len(s):
        return len(s.encode('utf-8'))
    

    Works fine in Python 2 and 3.

    0 讨论(0)
提交回复
热议问题