Python returning the wrong length of string when using special characters

前端未结

关注

 5  1348

庸人自扰 2020-12-03 05:48

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I gu

5条回答

半阙折子戏 (楼主)

2020-12-03 06:09
You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.

The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.

"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.

You can use the unicodedata.name() function to tell you what each component is.

Here's an example:
```
# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
    try:
        name = unicodedata.name(c)
    except:
        name = ""
    print "U+%04X" % ord(c), repr(c), name
```
Results:
```
u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T
```
Now read @bobince's answer :-)
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...