Decode Unicode's to Local language in Bigquery

余生颓废 提交于 2019-12-13 03:00:42

问题


We receive a survey Web-hook data in Bigquery. The comment in local language is captured as unicode and we do have special character in that comment.

  • Example

    • Comment in survey- "别老是晚点,现场补行李费太贵"
    • Comment in Bigquery data- "\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35"

We found a solution for decode individual comment :-

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
    with sample AS (SELECT '\u522b\u8001\u662f\u665a' AS S)
    SELECT utf8convert(s) from sample

When implement this code in comment field with thousand of comment and different languages its not working.

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
   SELECT Comment, utf8convert(Comment) as Convert
   FROM `airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
   where Comment is not null 

no error while running but the result is in Unicode does not change to local language. Result: local language in Unicode

  • I have try this code

      CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
      IF(s NOT LIKE '%\\u%', s,
      (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\\u')) AS x
       WHERE x != ''))
      );
    
      SELECT
      original,
      DecodeUnicode(original) AS decoded
      FROM (
      SELECT trim(r'$-\u6599\u91d1\u304c\u9ad8\u3059\u304e\uff01\uff01\uff01') AS original UNION ALL
      SELECT trim(r'abcd')
      );
    

shows error i think its because the comment start with special character?.


回答1:


See if this works. It does the "manual" decoding for strings that have \u in them by converting to Unicode code points and then to a string. It should be faster than using JavaScript, too.

CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
  IF(s NOT LIKE '%\\u%', s,
     (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\\u')) AS x
      WHERE x != ''))
);

SELECT
  original,
  DecodeUnicode(original) AS decoded
FROM (
  SELECT r'\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35' AS original UNION ALL
  SELECT r'abcd'
);

As output, this returns 别老是晚点,现场补行李费太贵 and abcd.



来源:https://stackoverflow.com/questions/47150140/decode-unicodes-to-local-language-in-bigquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!