UTF16 hex to text

给你一囗甜甜゛ 提交于 2019-12-07 12:05:28

问题


I have UTF-16 hex representation such as “0633064406270645” which is "سلام" in Arabic language.

I would like to convert it to its text equivalent. Is there a straight way to do that in PostgreSQL?

I can convert the UTF code point like below; unfortunately it seems UTF16 is not supported. Any ideas on how to do it in PostgreSQL, worst case I will write a function?

SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8');

"سلام"

SELECT convert_from (decode (E'0633064406270645', 'hex'),'UTF16');

ERROR:  invalid source encoding name "UTF16"
********** Error **********

回答1:


That's right, Postgres doesn't support UTF-16.

However, it does support Unicode escape sequences:

SELECT U&'\0633\0644\0627\0645'

But keep in mind that Unicode code points and UTF-16 code units are only equivalent in the Basic Multilingual Plane. In other words, if you have any UTF-16 characters which span multiple 16-bit code units, you'll need to translate them to the corresponding code point yourself.




回答2:


convert_from (or PostgreSQL in general) doesn't support UTF-16, but you may resort to one of the optional languages that do.

Example in plperlu (requires database superuser privileges to create the function, and CREATE LANGUAGE plperlu if not created already):

CREATE FUNCTION decode_utf16(text) RETURNS text AS $$
  require Encode;
  return Encode::decode("UTF-16BE", pack("H*", $_[0]));
$$ immutable language plperlu;

=> select decode_utf16('0633064406270645');

 decode_utf16 
--------------
 سلام



回答3:


PostgreSQL does not support UTF-16 natively. I suggest you to convert your data to UTF-8 before supplying it to the DB. If it's too late (wrong data already exists in your DB), you can use these maintenance functions to convert data from UTF-16 (logic copied from wikipedia):

-- convert from bytea, containing UTF-16-BE data
CREATE OR REPLACE FUNCTION convert_from_utf16be(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
WITH source(unit) AS (
  SELECT (get_byte(utf16_data, i) << 8) | get_byte(utf16_data, i + 1)
  FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
  SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
  FROM source
)
SELECT string_agg(CASE
  WHEN unit >= 56320 AND unit <= 57343 THEN CASE
    WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
    ELSE invalid_replacement
  END
  WHEN unit >= 55296 AND unit <= 56319 THEN CASE
    WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
    ELSE invalid_replacement
  END
  ELSE chr(unit)
END, '')
FROM codes
$function$;

-- convert from bytea, containing UTF-16-LE data
CREATE OR REPLACE FUNCTION convert_from_utf16le(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
WITH source(unit) AS (
  SELECT get_byte(utf16_data, i) | (get_byte(utf16_data, i + 1) << 8)
  FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
  SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
  FROM source
)
SELECT string_agg(CASE
  WHEN unit >= 56320 AND unit <= 57343 THEN CASE
    WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
    ELSE invalid_replacement
  END
  WHEN unit >= 55296 AND unit <= 56319 THEN CASE
    WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
    ELSE invalid_replacement
  END
  ELSE chr(unit)
END, '')
FROM codes
$function$;

-- convert from bytea, containing UTF-16 data (with or without BOM)
CREATE OR REPLACE FUNCTION convert_from_utf16(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE COALESCE(octet_length(utf16_data), 0)
  WHEN 0 THEN ''
  WHEN 1 THEN invalid_replacement
  ELSE CASE substring(utf16_data FOR 2)
    WHEN E'\\xFFFE' THEN convert_from_utf16le(substring(utf16_data FROM 3), invalid_replacement)
    ELSE convert_from_utf16be(substring(utf16_data FROM 3), invalid_replacement)
  END
END
$function$;

With these functions, you can convert from all kind of UTF-16:

SELECT convert_from_utf16be(decode('0633064406270645D852DF62', 'hex')),
       convert_from_utf16le(decode('330644062706450652D862DF', 'hex')),
       convert_from_utf16(decode('FEFF0633064406270645D852DF62', 'hex')),
       convert_from_utf16(decode('FFFE330644062706450652D862DF', 'hex'));

-- convert_from_utf16be | convert_from_utf16le | convert_from_utf16 | convert_from_utf16
------------------------+----------------------+--------------------+-------------------
-- سلام𤭢                | سلام𤭢                | سلام𤭢              | سلام𤭢


来源:https://stackoverflow.com/questions/26607867/utf16-hex-to-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!