Combining Devanagari characters

前端 未结 6 2016
礼貌的吻别
礼貌的吻别 2020-12-05 02:40

I have something like

a = \"बिक्रम मेरो नाम हो\"

I want to achieve something like

a[0] = बि
a[1] = क्र
a[3] = म
         


        
6条回答
  •  暗喜
    暗喜 (楼主)
    2020-12-05 03:32

    So, you want to achieve something like this

    a[0] = बि a[1] = क्र a[3] = म
    

    My advice is to ditch the idea that string indexing corresponds to the characters you see on the screen. Devanagari, as well as several other scripts, do not play well with programmers who grew up with Latin characters. I suggest reading the Unicode standard chapter 9 (available here).

    It looks like what you are trying to do is break a string into grapheme clusters. String indexing by itself will not let you do this. Hangul is another script which plays poorly with string indexing, although with combining characters, even something as familiar as Spanish will cause problems.

    You will need an external library such as ICU to achieve this (unless you have lots of free time). ICU has Python bindings.

    >>> a = u"बिक्रम मेरो नाम हो"
    >>> import icu
        # Note: This next line took a lot of guesswork.  The C, C++, and Java
        # interfaces have better documentation.
    >>> b = icu.BreakIterator.createCharacterInstance(icu.Locale())
    >>> b.setText(a)
    >>> i = 0
    >>> for j in b:
    ...     s = a[i:j]
    ...     print '|', s, len(s)
    ...     i = j
    ... 
    | बि 2
    | क् 2
    | र 1
    | म 1
    |   1
    | मे 2
    | रो 2
    |   1
    | ना 2
    | म 1
    |   1
    | हो 2
    

    Note how some of these "characters" (grapheme clusters) have length 2, and some have length 1. This is why string indexing is problematic: if I want to get grapheme cluster #69450 from a text file, then I have to linearly scan through the entire file and count. So your options are:

    • Build an index (kind of crazy...)
    • Just realize that you can't break on every character boundary. The break iterator object is capable of going both forwards AND backwards, so if you need to extract the first 140 characters of a string, then you look at index 140 and iterate backwards to the previous grapheme cluster break, that way you don't end up with funny text. (Better yet, you can use a word break iterator for the appropriate locale.) The benefit of using this level of abstraction (character iterators and the like) is that it no longer matters which encoding you use: you can use UTF-8, UTF-16, UTF-32 and it all just works. Well, mostly works.

提交回复
热议问题