Python returning the wrong length of string when using special characters

前端 未结 5 1342
庸人自扰
庸人自扰 2020-12-03 05:48

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I gu

5条回答
  •  隐瞒了意图╮
    2020-12-03 06:16

    The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.

    Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining’ diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:

    >>> unicodedata.normalize('NFC', u'a\u0301')
    u'\xe1' # single character: á
    

    However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘ë́’. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e’.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.

    To work out where the ‘editable points’ in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).

    If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:

    def withoutcombining(s):
        return ''.join(c for c in s if unicodedata.combining(c)==0)
    
    >>> withoutcombining(u'ë́aúlt')
    '\xeba\xfalt' # ëaúlt
    >>> len(_)
    5
    

提交回复
热议问题