Python UTF-8 Lowercase Turkish Specific Letter

问题

with using python 2.7:

>myCity = 'Isparta'
>myCity.lower()
>'isparta'
#-should be-
>'ısparta'

tried some decoding, (like, myCity.decode("utf-8").lower()) but could not find how to do it.

how can lower this kinds of letters? ('I' > 'ı', 'İ' > 'i' etc)

EDIT: In Turkish, lower case of 'I' is 'ı'. Upper case of 'i' is 'İ'

回答1:

Some have suggested using the tr_TR.utf8 locale. At least on Ubuntu, perhaps related to this bug, setting this locale does not produce the desired result:

import locale
locale.setlocale(locale.LC_ALL, 'tr_TR.utf8')

myCity = u'Isparta İsparta'
print(myCity.lower())
# isparta isparta

So if this bug affects you, as a workaround you could perform this translation yourself:

lower_map = {
    ord(u'I'): u'ı',
    ord(u'İ'): u'i',
    }

myCity = u'Isparta İsparta'
lowerCity = myCity.translate(lower_map)
print(lowerCity)
# ısparta isparta

prints

ısparta isparta

回答2:

You should use new derived class from unicode from emre's solution

class unicode_tr(unicode):
    CHARMAP = {
        "to_upper": {
            u"ı": u"I",
            u"i": u"İ",
        },
        "to_lower": {
            u"I": u"ı",
            u"İ": u"i",
        }
    }

    def lower(self):
        for key, value in self.CHARMAP.get("to_lower").items():
            self = self.replace(key, value)
        return self.lower()

    def upper(self):
        for key, value in self.CHARMAP.get("to_upper").items():
            self = self.replace(key, value)
        return self.upper()

if __name__ == '__main__':
    print unicode_tr("kitap").upper()
    print unicode_tr("KİTAP").lower()

Gives

KİTAP
kitap

This must solve your problem.

回答3:

You need to set the proper locale (I'm guessing tr-TR) with locale.setLocale(). Otherwise the default upper-lower mappings will be used, and if that default is en-US, the lowercase version of I is i.

回答4:

You can just use .replace() function before changing to upper/lower. In your case:

    myCity.replace('I', 'ı').lower()

回答5:

I forked and redesigned Emre's solution by monkey-patching method to built-in unicode module. The advantage of this new approach is no need to use a subclass of unicode and redefining unicode strings by my_unicode_string = unicode_tr(u'bla bla bla') Just importing this module, integrates seamlessly with builtin native unicode strings

https://github.com/technic-programming/unicode_tr

# -*- coding: utf8 -*-
# Redesigned by @guneysus

import __builtin__
from forbiddenfruit import curse

lcase_table = tuple(u'abcçdefgğhıijklmnoöprsştuüvyz')
ucase_table = tuple(u'ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ')

def upper(data):
    data = data.replace('i',u'İ')
    data = data.replace(u'ı',u'I')
    result = ''
    for char in data:
        try:
            char_index = lcase_table.index(char)
            ucase_char = ucase_table[char_index]
        except:
            ucase_char = char
        result += ucase_char
    return result

def lower(data):
    data = data.replace(u'İ',u'i')
    data = data.replace(u'I',u'ı')
    result = ''
    for char in data:
        try:
            char_index = ucase_table.index(char)
            lcase_char = lcase_table[char_index]
        except:
            lcase_char = char
        result += lcase_char
    return result

def capitalize(data):
    return data[0].upper() + data[1:].lower()

def title(data):
    return " ".join(map(lambda x: x.capitalize(), data.split()))

curse(__builtin__.unicode, 'upper', upper)
curse(__builtin__.unicode, 'lower', lower)
curse(__builtin__.unicode, 'capitalize', capitalize)
curse(__builtin__.unicode, 'title', title)

if __name__ == '__main__':
    print u'istanbul'.upper()
    print u'İSTANBUL'.lower()

来源：https://stackoverflow.com/questions/19030948/python-utf-8-lowercase-turkish-specific-letter

标签

python

unicode

encoding

utf-8