Find possible duplicates in two columns ignoring case and special characters

后端 未结 3 1086
北海茫月
北海茫月 2021-02-09 05:22

Query

SELECT COUNT(*), name, number
FROM   tbl
GROUP  BY name, number
HAVING COUNT(*) > 1

It sometimes fails to find duplicates between lowe

3条回答
  •  星月不相逢
    2021-02-09 05:57

    (Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résumé into resume, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.

    The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ö might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.

    You can see this in English with ligatures, where the word dæmon is ascii-ized daemon. If you stripped the ligature you'd get dmon which wouldn't match daemon, the common spelling. The same is true of æther which is typically ascii-ized to aether or ether. You can also see this in German with ß, typically "expanded" as ss.

    If you must attempt to "un-accent", "normalize" accents or "strip" accents:

    You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \W escape (shorthand for the character class [^[:alnum:]_] as per the manual) to exclude "symbols" but not accented characters:

    regress=# SELECT regexp_replace(lower(x),'\W','','g') 
              FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
     regexp_replace 
    ----------------
     soft
     café
    (2 rows)
    

    If you want to filter out accented chars too you can define your own character class:

    regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
              FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
     regexp_replace 
    ----------------
     soft
     caf
    (2 rows)
    

    If you actually intended to substitute some accented characters for similar unaccented characters, you could use translate as per this wiki article:

    regress=# SELECT translate(
            lower(x),
            'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
            'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
        )
        FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
    
     translate 
    -----------
     $s^o&f!t
     cafe
    (2 rows)
    

提交回复
热议问题