How to handle diacritics (accents) when rewriting 'pretty URLs'

前端 未结 6 801
长情又很酷
长情又很酷 2020-11-30 09:48

I rewrite URLs to include the title of user generated travelblogs.

I do this for both readability of URLs and SEO purposes.

 http://www.example.com/gall         


        
相关标签:
6条回答
  • 2020-11-30 10:24

    Ultimately, you're going to have to give up on the idea of "correct", for this problem. Translating the string, no matter how you do it, destroys accuracy in the name of compatibility and readability. All three options are equally compatible, but #1 and #2 suffer in terms of readability. So just run with it and go for whatever looks best — option #3.

    Yes, the translations are wrong for German, but unless you start requiring your users to specify what language their titles are in (and restricting them to only one), you're not going to solve that problem without far more effort than it's worth. (For example, running each word in the title through dictionaries for each known language and translating that word's diacritics according to the rules of its language would work, but it's excessive.)

    Alternatively, if German is a higher concern than other languages, make your translation always use the German version when one exists: äae, ëe, ïi, öoe, üue.

    Edit:

    Oh, and as for the actual method, I'd translate the special cases, if any, via str_replace, then use iconv for the rest:

    $text = str_replace(array("ä", "ö", "ü", "ß"), array("ae", "oe", "ue", "ss"), $text);
    $text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
    
    0 讨论(0)
  • 2020-11-30 10:30

    Nice topic, I had the same problem a while ago.
    Here's how I fixed it:

    function title2url($string=null){
     // return if empty
     if(empty($string)) return false;
    
     // replace spaces by "-"
     // convert accents to html entities
     $string=htmlentities(utf8_decode(str_replace(' ', '-', $string)));
    
     // remove the accent from the letter
     $string=preg_replace(array('@&([a-zA-Z]){1,2}(acute|grave|circ|tilde|uml|ring|elig|zlig|slash|cedil|strok|lig){1};@', '@&[euro]{1};@'), array('${1}', 'E'), $string);
    
     // now, everything but alphanumeric and -_ can be removed
     // aso remove double dashes
     $string=preg_replace(array('@[^a-zA-Z0-9\-_]@', '@[\-]{2,}@'), array('', '-'), html_entity_decode($string));
    }
    

    Here's how my function works:

    1. Convert it to html entities
    2. Strip the accents
    3. Remove all remaining weird chars
    0 讨论(0)
  • 2020-11-30 10:32

    Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.

    On the contrary, most are allowed. See for example Wikipedia's URLs - things like http://en.wikipedia.org/wiki/Café (aka http://en.wikipedia.org/wiki/Caf%C3%A9) display nicely - even if StackOverflow's highlighter doesn't pick them out correctly :-)

    The trick is reading them reliably across any hosting environment; there are problems with CGI and Windows servers, particularly IIS, for example.

    0 讨论(0)
  • 2020-11-30 10:36

    This is a good function:

    function friendlyURL($string) {
        setlocale(LC_CTYPE, 'en_US.UTF8');
        $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
        $string = str_replace(' ', '-', $string);
        $string = preg_replace('/\\s+/', '-', $string);
        $string = strtolower($string);
        return $string;
    }
    
    0 讨论(0)
  • 2020-11-30 10:38

    To me the third is most readable.

    You could use a little dictionary e.g. ï -> i and ü -> ue to specify how you'd like various charcaters to be translated.

    0 讨论(0)
  • 2020-11-30 10:40

    As an interesting side note, on SO nothing seems to really matter after the ID -- this is a link to this page:

    How to handle diacritics (accents) when rewriting 'pretty URLs'

    Obviously the motivation is to allow title changes without breaking links, and you may want to consider that feature as well.

    0 讨论(0)
提交回复
热议问题