How to make Django slugify work properly with Unicode strings?

后端 未结 8 1871
猫巷女王i
猫巷女王i 2020-11-28 19:58

What can I do to prevent slugify filter from stripping out non-ASCII alphanumeric characters? (I\'m using Django 1.0.2)

cnprog.com has Chinese character

8条回答
  •  孤街浪徒
    2020-11-28 20:07

    Also, the Django version of slugify doesn't use the re.UNICODE flag, so it wouldn't even attempt to understand the meaning of \w\s as it pertains to non-ascii characters.

    This custom version is working well for me:

    def u_slugify(txt):
            """A custom version of slugify that retains non-ascii characters. The purpose of this
            function in the application is to make URLs more readable in a browser, so there are 
            some added heuristics to retain as much of the title meaning as possible while 
            excluding characters that are troublesome to read in URLs. For example, question marks 
            will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
            characters will also be hex-encoded in the raw URL, most browsers will display them
            as human-readable glyphs in the address bar -- those should be kept in the slug."""
            txt = txt.strip() # remove trailing whitespace
            txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
            txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
            txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
            txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
            txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
            return txt
    

    Note the last regex substitution. This is a workaround to a problem with the more robust expression r'\W', which seems to either strip out some non-ascii characters or incorrectly re-encode them, as illustrated in the following python interpreter session:

    Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
    [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
    >>> str = '您認識對全球社區感興趣的中國攝影師嗎'
    >>> str
    '\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
    >>> print str
    您認識對全球社區感興趣的中國攝影師嗎
    >>> # Substitute all non-word characters with X
    >>> re_str = re.sub('\W', 'X', str, re.UNICODE)
    >>> re_str
    'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
    >>> print re_str
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?的中國攝影師嗎
    >>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
    >>> # And where did that question mark come from?
    >>> 
    >>> 
    >>> # Now do the same with only the last three glyphs of the string
    >>> str = '影師嗎'
    >>> print str
    影師嗎
    >>> str
    '\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
    >>> re.sub('\W','X',str,re.U)
    'XXXXXXXXX'
    >>> re.sub('\W','X',str)
    'XXXXXXXXX'
    >>> # Huh, now it seems to think those same characters are NOT word characters
    

    I am unsure what the problem is above, but I'm guessing that it stems from "whatever is classified as alphanumeric in the Unicode character properties database," and how that is implemented. I have heard that python 3.x has a high priority on better unicode handling, so this may be fixed already. Or, maybe it is correct python behavior, and I am misusing unicode and/or the Chinese language.

    For now, a work-around is to avoid character classes, and make substitutions based on explicitly defined character sets.

提交回复
热议问题