Python returning the wrong length of string when using special characters

前端 未结 5 1348
庸人自扰
庸人自扰 2020-12-03 05:48

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I gu

5条回答
  •  半阙折子戏
    2020-12-03 06:09

    You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.

    The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.

    "Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.

    You can use the unicodedata.name() function to tell you what each component is.

    Here's an example:

    # coding: utf8
    import unicodedata
    x = u"ë́aúlt"
    print(repr(x))
    for c in x:
        try:
            name = unicodedata.name(c)
        except:
            name = ""
        print "U+%04X" % ord(c), repr(c), name
    

    Results:

    u'\xeb\u0301a\xfalt'
    U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
    U+0301 u'\u0301' COMBINING ACUTE ACCENT
    U+0061 u'a' LATIN SMALL LETTER A
    U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
    U+006C u'l' LATIN SMALL LETTER L
    U+0074 u't' LATIN SMALL LETTER T
    

    Now read @bobince's answer :-)

提交回复
热议问题