BigQuery: Convert accented characters to their plain ascii equivalents

后端 未结 4 912
花落未央
花落未央 2020-12-17 04:21

I have the following string:

brasília

And I need to convert to:

brasilia

Withou the ´ accent!

Ho

4条回答
  •  借酒劲吻你
    2020-12-17 04:36

    It's worth mentioning that what you're asking for is a simplified case of unicode text normalization. Many languages have a function for this in their standard libraries (e.g., Java). One good approach would be to insert your text BigQuery already normalized. If that won't work -- for example, because you need to retain the original text and you're concerned about hitting BigQuery's row size limit -- then you'll need to do normalization on the fly in your queries.

    Some databases have implementations of Unicode normalization of various completeness (e.g., PostgreSQL's unaccent method, PrestoDB's normalize method) for use in queries. Unfortunately, BigQuery is not one of them. There is no text normalization function in BigQuery as of this writing. The implementations on this answer are kind of a "roll your own unaccent." When BigQuery releases an official function, everyone should use that instead!

    Assuming you need to do the normalization in your query (and Google still hasn't come out with a function for this yet), these are some reasonable options.

    Approach 1: Use NORMALIZE

    Google now has come out with a NORMALIZE function. (Thanks to @WillianFuks in the comments for flagging!) This is now the obvious choice for text normalization. For example:

    SELECT REGEXP_REPLACE(NORMALIZE(text), r"\pM", '') FROM yourtable;
    

    There is a brief explanation of how this works and why the call to REGEXP_REPLACE is needed in the comments.

    I have left the additional approaches for reference.

    Approach 2: Use REGEXP_REPLACE and REPLACE on Content

    I implemented the lowercase-only case of text normalization in legacy SQL using REGEXP_REPLACE. (The analog in Standard SQL is fairly self-evident.) I ran some tests on a text field with average length around 1K in a large table of 28M rows using the query below:

    SELECT id, text FROM
      (SELECT 
        id,
        CASE
        WHEN REGEXP_CONTAINS(LOWER(text), r"[àáâäåæçèéêëìíîïòóôöøùúûüÿœ]") THEN
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REGEXP_REPLACE(
                  REGEXP_REPLACE(
                    REPLACE(REPLACE(REPLACE(REPLACE(LOWER(text), 'œ', 'ce'), 'ÿ', 'y'), 'ç', 'c'), 'æ', 'ae'),
                  r"[ùúûü]", 'u'),
                r"[òóôöø]", 'o'),
              r"[ìíîï]", 'i'),
            r"[èéêë]", 'e'),
          r"[àáâäå]", 'a')
        ELSE
          LOWER(text)
        END AS text
      FROM
       yourtable ORDER BY id LIMIT 10);
    

    versus:

    WITH lookups AS (
      SELECT 
      'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,ñ' AS accents,
      'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,n' AS latins
    ),
    pairs AS (
      SELECT accent, latin FROM lookups, 
        UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1, 
        UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
      WHERE p1 = p2
    )
    SELECT foo FROM (
      SELECT 
        id,
        (SELECT STRING_AGG(IFNULL(latin, char), '') AS foo FROM UNNEST(SPLIT(LOWER(text), '')) char LEFT JOIN pairs ON char=accent) AS foo
      FROM
      yourtable ORDER BY id LIMIT 10);
    

    On average, the REGEXP_REPLACE implementation ran in about 2.9s; the array-based implementation ran in about 12.5s.

    Approach 3: Use REGEXP_REPLACE on Search Pattern

    What brought me to this question my was a search use case. For this use case, I can either normalize my corpus text so that it looks more like my query, or I can "denormalize" my query so that it looks more like my text. The above describes an implementation of the first approach. This describes an implementation of the second.

    When searching for a single word, one can use the REGEXP_MATCH match function and merely update the query using the following patterns:

    a -> [aàáaâäãåā]
    e -> [eèéêëēėę]
    i -> [iîïíīįì]
    o -> [oôöòóøōõ]
    u -> [uûüùúū]
    y -> [yÿ]
    s -> [sßśš]
    l -> [lł]
    z -> [zžźż]
    c -> [cçćč]
    n -> [nñń]
    æ -> (?:æ|ae)
    œ -> (?:œ|ce)
    

    So the query "hello" would look like this, as a regexp:

    r"h[eèéêëēėę][lł][lł][oôöòóøōõ]"
    

    Transforming the word into this regular expression should be fairly straightforward in any language. This isn't a solution to the posted question -- "How do I remove accents in BigQuery?" -- but is rather a solution to a related use case, which might have brought people (like me!) to this page.

提交回复
热议问题