Python Inconsistent Special Character Storage In String

问题

Version is Python 3.7. I've just found out python sometimes will store the character ñ in a string with multiple representations and I'm completely at a loss as to why or how to deal with it.

I'm not sure the best way to show this issue, so I'm just going to show some code output.

I have two strings, s1 and s2 both set to equal 'Dan Peña'

They are both of type string.

I can run the code:

print(s1 == s2) # prints false
print(len(s1)) # prints 8
print(len(s2)) # prints 9
print(type(s1)) # print 'str'
print(type(s2)) # print 'str'
for i in range(len(s1)):
    print(s1[i] + ", " + s2[i])

The output of the loop would be:

D, D
a, a
n, n
 ,  
P, P
e, e
ñ, n
a, ~

So, are there any python methods for dealing with these inconsistencies, or at least some specification as to when python will use which representation?

It would also be nice to know why Python would choose to implement this way.

Edit:

One string is being retrieved from a django database and the other string is from a string obtained from parsing a filename from a list dir call.

from app.models import Model
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def handle(self, *args, **kwargs):
        load_dir = "load_dir_name"
        save_dir = "save_dir"

        files = listdir(load_dir)
        save_file_map = {file[:file.index("_thumbnail.jpg")]: f"{save_dir}/{file}" for file in files}
        for obj in Model.objects.all():
            s1 = obj.title
            save_file_path = save_file_map[s1] # Key error when encountering ñ.

However, when I search through the save_file_map dict I find a key that is exactly the same as s1 except the ñ is encoded as characters n~ rather than character ñ.

Note that the files I load in the above code with list dir are named base on the obj.title field in the first place, so it should be guaranteed that a file with the name is in the load_dir directory.

回答1:

You'll want to normalize the strings to use the same representation. Right now, one of them is using an n character + a tilde character (2 chars), while the other is using a single character representing an n with a tilde.

unicodedata.normalize should do what you want. See the docs here.

You'll want to call this like so: unicodedata.normalize('NFC', s1). 'NFC' tells unicodedata.normalize that you want to use the composed forms for everything, e.g. the 1 char version of ñ. There are other options supplied in the docs besides 'NFC', which one you use is totally up to you.

Now, at what point you normalize is up to you (I don't know how you app is structured). For example you could normalize before inserting into the database, or normalize every time you read from the database.

来源：https://stackoverflow.com/questions/56759851/python-inconsistent-special-character-storage-in-string

标签

python

python-3.x

string

python-3.7

unicode-normalization