How can I find any non ASCII characters in Redshift database

问题

I've a database table I'd like to return all values where a column contains a non ASCII character anywhere in the string.

Is there an easy way to do this?

I've tried this

select col_name,  regexp_instr(col_name,'[^[:ascii:]]') 
from test_table s
where created > sysdate - 1 
and regexp_instr(col_name,'[^[:ascii:]]') > 0 
limit 5;

but get this error:

error:  Invalid character class name, collating name, or character range.  The error occured while parsing the regular expression: '[^[:>>>HERE>>>ascii:]]'.
  code:      8002
  context:   T_regexp_init
  query:     5059536
  location:  funcs_expr.cpp:130
  process:   query20_31 [pid=7903]

Thanks!

回答1:

You can select the range or characters you wish to use by looking at this table: http://unicode-table.com/en/ .

Look at the code below, where I filter based on a specific character range.

dev=# create table tb1 ( id integer, col1 varchar(10) , col2 varchar(10) ) distkey(id);
CREATE TABLE

dev=# insert into tb1 values (1, 'ABC', 'DEF');
INSERT 0 1

dev=# insert into tb1 select 1, chr(638), chr(640);
INSERT 0 1
dev=# insert into tb1 select 4, concat(concat('AB', chr(638)),'CD') , chr(640);                                                                                  
INSERT 0 1

dev=# select * from tb1;
 id | col1  | col2 
----+-------+------
  4 | ABɾCD | ʀ
  1 | ABC   | DEF
  1 | ɾ     | ʀ
(3 rows)

dev=# select id, col1,  regexp_instr(col1,'[^\x00-\x7F]') , col2,  regexp_instr(col2,'[^\x00-\x7F]')  from tb1;
 id | col1  | regexp_instr | col2 | regexp_instr 
----+-------+--------------+------+--------------
  4 | ABɾCD |            3 | ʀ    |            1
  1 | ABC   |            0 | DEF  |            0
  1 | ɾ     |            1 | ʀ    |            1
(3 rows)

回答2:

I was trying to accomplish something similar recently but the above solution (writing '[^\x00-\x7F]' in the regex expression) won't work.
Usually, a backslash in combination with a literal character can create a regex token with a special meaning, in this case \x represents "the character whose hexadecimal value is" where 00 and 7F are the hex values.
While Postgres supports it (see 9.7.3.3. Regular Expression Escapes), it seems that Redshift's regex engine does not. You can check what exactly Redshift supports here.

The shortest and cleanest workaround I've found for this is:

SELECT column_name,
       CASE
           WHEN regexp_instr(column_name, '[^[:print:][:cntrl:]]') > 0 THEN TRUE
           ELSE FALSE END AS has_non_ascii_char
FROM table_name
WHERE has_non_ascii_char;

来源：https://stackoverflow.com/questions/39375454/how-can-i-find-any-non-ascii-characters-in-redshift-database

标签

ascii

amazon-redshift