BigQuery: Convert accented characters to their plain ascii equivalents

后端 未结 4 911
花落未央
花落未央 2020-12-17 04:21

I have the following string:

brasília

And I need to convert to:

brasilia

Withou the ´ accent!

Ho

相关标签:
4条回答
  • 2020-12-17 04:34

    I like this answer explanation. You can use:

    REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '')
    

    As a simple example:

    WITH data AS(
      SELECT 'brasília / paçoca' AS text
    )
    
    SELECT
      REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') RemovedDiacritics
    FROM data
    

    brasilia / pacoca

    UPDATE

    With the new string function Translate, it's much simpler to do it:

    WITH data AS(
      SELECT 'brasília / paçoca' AS text
    )
    
    SELECT
      translate(text, "ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ", "SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy") as RemovedDiacritics
    FROM data
    

    brasilia / pacoca

    0 讨论(0)
  • 2020-12-17 04:36

    It's worth mentioning that what you're asking for is a simplified case of unicode text normalization. Many languages have a function for this in their standard libraries (e.g., Java). One good approach would be to insert your text BigQuery already normalized. If that won't work -- for example, because you need to retain the original text and you're concerned about hitting BigQuery's row size limit -- then you'll need to do normalization on the fly in your queries.

    Some databases have implementations of Unicode normalization of various completeness (e.g., PostgreSQL's unaccent method, PrestoDB's normalize method) for use in queries. Unfortunately, BigQuery is not one of them. There is no text normalization function in BigQuery as of this writing. The implementations on this answer are kind of a "roll your own unaccent." When BigQuery releases an official function, everyone should use that instead!

    Assuming you need to do the normalization in your query (and Google still hasn't come out with a function for this yet), these are some reasonable options.

    Approach 1: Use NORMALIZE

    Google now has come out with a NORMALIZE function. (Thanks to @WillianFuks in the comments for flagging!) This is now the obvious choice for text normalization. For example:

    SELECT REGEXP_REPLACE(NORMALIZE(text), r"\pM", '') FROM yourtable;
    

    There is a brief explanation of how this works and why the call to REGEXP_REPLACE is needed in the comments.

    I have left the additional approaches for reference.

    Approach 2: Use REGEXP_REPLACE and REPLACE on Content

    I implemented the lowercase-only case of text normalization in legacy SQL using REGEXP_REPLACE. (The analog in Standard SQL is fairly self-evident.) I ran some tests on a text field with average length around 1K in a large table of 28M rows using the query below:

    SELECT id, text FROM
      (SELECT 
        id,
        CASE
        WHEN REGEXP_CONTAINS(LOWER(text), r"[àáâäåæçèéêëìíîïòóôöøùúûüÿœ]") THEN
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REGEXP_REPLACE(
                  REGEXP_REPLACE(
                    REPLACE(REPLACE(REPLACE(REPLACE(LOWER(text), 'œ', 'ce'), 'ÿ', 'y'), 'ç', 'c'), 'æ', 'ae'),
                  r"[ùúûü]", 'u'),
                r"[òóôöø]", 'o'),
              r"[ìíîï]", 'i'),
            r"[èéêë]", 'e'),
          r"[àáâäå]", 'a')
        ELSE
          LOWER(text)
        END AS text
      FROM
       yourtable ORDER BY id LIMIT 10);
    

    versus:

    WITH lookups AS (
      SELECT 
      'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,ñ' AS accents,
      'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,n' AS latins
    ),
    pairs AS (
      SELECT accent, latin FROM lookups, 
        UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1, 
        UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
      WHERE p1 = p2
    )
    SELECT foo FROM (
      SELECT 
        id,
        (SELECT STRING_AGG(IFNULL(latin, char), '') AS foo FROM UNNEST(SPLIT(LOWER(text), '')) char LEFT JOIN pairs ON char=accent) AS foo
      FROM
      yourtable ORDER BY id LIMIT 10);
    

    On average, the REGEXP_REPLACE implementation ran in about 2.9s; the array-based implementation ran in about 12.5s.

    Approach 3: Use REGEXP_REPLACE on Search Pattern

    What brought me to this question my was a search use case. For this use case, I can either normalize my corpus text so that it looks more like my query, or I can "denormalize" my query so that it looks more like my text. The above describes an implementation of the first approach. This describes an implementation of the second.

    When searching for a single word, one can use the REGEXP_MATCH match function and merely update the query using the following patterns:

    a -> [aàáaâäãåā]
    e -> [eèéêëēėę]
    i -> [iîïíīįì]
    o -> [oôöòóøōõ]
    u -> [uûüùúū]
    y -> [yÿ]
    s -> [sßśš]
    l -> [lł]
    z -> [zžźż]
    c -> [cçćč]
    n -> [nñń]
    æ -> (?:æ|ae)
    œ -> (?:œ|ce)
    

    So the query "hello" would look like this, as a regexp:

    r"h[eèéêëēėę][lł][lł][oôöòóøōõ]"
    

    Transforming the word into this regular expression should be fairly straightforward in any language. This isn't a solution to the posted question -- "How do I remove accents in BigQuery?" -- but is rather a solution to a related use case, which might have brought people (like me!) to this page.

    0 讨论(0)
  • 2020-12-17 04:37

    Try below as quick and simple option for you:

    #standardSQL
    WITH lookups AS (
      SELECT 
      'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
      'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
    ),
    pairs AS (
      SELECT accent, latin FROM lookups, 
        UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1, 
        UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
      WHERE p1 = p2
    ),
    yourTableWithWords AS (
      SELECT word FROM UNNEST(
            SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
        ) AS word
    )
    SELECT 
      word AS word_with_accent, 
      (SELECT STRING_AGG(IFNULL(latin, char), '')
        FROM UNNEST(SPLIT(word, '')) char
        LEFT JOIN pairs
        ON char = accent) AS word_without_accent
    FROM yourTableWithWords   
    

    Output is

    word_with_accent word_without_accent     
    blessèd         blessed  
    El Niño         El Nino  
    belle époque    belle epoque     
    boîte           boite    
    Boötes          Bootes   
    blasé           blase    
    ångström        angstrom     
    bobèche         bobeche  
    barège          barege   
    bric-à-brac     bric-a-brac  
    bête noire      bete noire   
    Bichon Frisé    Bichon Frise     
    Brontë Beyoncé  Bronte Beyonce   
    bêtise          betise   
    beau idéal      beau ideal   
    bombé           bombe    
    brasília        brasilia     
    boutonnière     boutonniere  
    aperçu          apercu   
    béguin          beguin   
    Bön             Bon   
    

    UPDATE

    Below is how to pack this logic into SQL UDF - so accent2latin(word) can be called to make a "magic"

    #standardSQL
    CREATE TEMP FUNCTION accent2latin(word STRING) AS
    ((
      WITH lookups AS (
        SELECT 
        'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
        'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
      ),
      pairs AS (
        SELECT accent, latin FROM lookups, 
          UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1, 
          UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
        WHERE p1 = p2
      )
      SELECT STRING_AGG(IFNULL(latin, char), '')
      FROM UNNEST(SPLIT(word, '')) char
      LEFT JOIN pairs
      ON char = accent
    ));
    
    WITH yourTableWithWords AS (
      SELECT word FROM UNNEST(
            SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
        ) AS word
    )
    SELECT 
      word AS word_with_accent, 
      accent2latin(word) AS word_without_accent
    FROM yourTableWithWords 
    
    0 讨论(0)
  • 2020-12-17 04:52

    You can call REPLACE() or REGEXP_REPLACE(). You can find some regular expressions at Remove accents/diacritics in a string in JavaScript.

    Alternatively, you can use javascript UDF, but I expect it to be slower.

    0 讨论(0)
提交回复
热议问题