Find possible duplicates in two columns ignoring case and special characters

安稳与你 提交于 2019-12-03 07:19:00
Erwin Brandstetter

lower()/ upper()

Use one of these to fold characters to either lower or upper case. Special characters are not affected:

SELECT count(*), lower(name), number
FROM   tbl
GROUP  BY lower(name), number
HAVING count(*) > 1;

unaccent()

If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():

CREATE EXTENSION unaccent;

Makes it very simple:

SELECT lower(unaccent('Büßercafé')) AS norm

Result:

busercafe

This doesn't strip non-letters. Add regexp_replace() like @Craig mentioned for that:

SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ))
                                                                     AS norm

Result:

softbusercafe

You can even build a functional index on top of that:

PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:

SELECT COUNT(*), lower(name), number FROM TABLE 
GROUP BY lower(name), number HAVING COUNT(*) > 1
  • NOTE: This has not been tested in Postgres

(Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résumé into resume, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.

The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ö might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.

You can see this in English with ligatures, where the word dæmon is ascii-ized daemon. If you stripped the ligature you'd get dmon which wouldn't match daemon, the common spelling. The same is true of æther which is typically ascii-ized to aether or ether. You can also see this in German with ß, typically "expanded" as ss.

If you must attempt to "un-accent", "normalize" accents or "strip" accents:

You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \W escape (shorthand for the character class [^[:alnum:]_] as per the manual) to exclude "symbols" but not accented characters:

regress=# SELECT regexp_replace(lower(x),'\W','','g') 
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 café
(2 rows)

If you want to filter out accented chars too you can define your own character class:

regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 caf
(2 rows)

If you actually intended to substitute some accented characters for similar unaccented characters, you could use translate as per this wiki article:

regress=# SELECT translate(
        lower(x),
        'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
        'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
    )
    FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);

 translate 
-----------
 $s^o&f!t
 cafe
(2 rows)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!