How can I find any non ASCII characters in Redshift database

五迷三道 提交于 2019-12-11 04:26:42

问题


I've a database table I'd like to return all values where a column contains a non ASCII character anywhere in the string.

Is there an easy way to do this?

I've tried this

select col_name,  regexp_instr(col_name,'[^[:ascii:]]') 
from test_table s
where created > sysdate - 1 
and regexp_instr(col_name,'[^[:ascii:]]') > 0 
limit 5;

but get this error:

error:  Invalid character class name, collating name, or character range.  The error occured while parsing the regular expression: '[^[:>>>HERE>>>ascii:]]'.
  code:      8002
  context:   T_regexp_init
  query:     5059536
  location:  funcs_expr.cpp:130
  process:   query20_31 [pid=7903]

Thanks!


回答1:


You can select the range or characters you wish to use by looking at this table: http://unicode-table.com/en/ .

Look at the code below, where I filter based on a specific character range.

dev=# create table tb1 ( id integer, col1 varchar(10) , col2 varchar(10) ) distkey(id);
CREATE TABLE

dev=# insert into tb1 values (1, 'ABC', 'DEF');
INSERT 0 1

dev=# insert into tb1 select 1, chr(638), chr(640);
INSERT 0 1
dev=# insert into tb1 select 4, concat(concat('AB', chr(638)),'CD') , chr(640);                                                                                  
INSERT 0 1

dev=# select * from tb1;
 id | col1  | col2 
----+-------+------
  4 | ABɾCD | ʀ
  1 | ABC   | DEF
  1 | ɾ     | ʀ
(3 rows)

dev=# select id, col1,  regexp_instr(col1,'[^\x00-\x7F]') , col2,  regexp_instr(col2,'[^\x00-\x7F]')  from tb1;
 id | col1  | regexp_instr | col2 | regexp_instr 
----+-------+--------------+------+--------------
  4 | ABɾCD |            3 | ʀ    |            1
  1 | ABC   |            0 | DEF  |            0
  1 | ɾ     |            1 | ʀ    |            1
(3 rows)



回答2:


I was trying to accomplish something similar recently but the above solution (writing '[^\x00-\x7F]' in the regex expression) won't work.
Usually, a backslash in combination with a literal character can create a regex token with a special meaning, in this case \x represents "the character whose hexadecimal value is" where 00 and 7F are the hex values.
While Postgres supports it (see 9.7.3.3. Regular Expression Escapes), it seems that Redshift's regex engine does not. You can check what exactly Redshift supports here.

The shortest and cleanest workaround I've found for this is:

SELECT column_name,
       CASE
           WHEN regexp_instr(column_name, '[^[:print:][:cntrl:]]') > 0 THEN TRUE
           ELSE FALSE END AS has_non_ascii_char
FROM table_name
WHERE has_non_ascii_char;


来源:https://stackoverflow.com/questions/39375454/how-can-i-find-any-non-ascii-characters-in-redshift-database

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!