问题
I encountered unique problem when using Redshift. Please see the below illustrative example:
drop table if exists joinTrim_temp1;
create table joinTrim_temp1(rowIndex1 int, charToJoin1 varchar(20));
insert into joinTrim_temp1 values(1, 'Sudan' );
insert into joinTrim_temp1 values(2, 'Africa' );
insert into joinTrim_temp1 values(3, 'USA' );
drop table if exists joinTrim_temp2;
create table joinTrim_temp2(rowIndex2 int, charToJoin2 varchar(20));
insert into joinTrim_temp2 values(1, 'Sudan ' );
insert into joinTrim_temp2 values(2, 'Africa ' );
insert into joinTrim_temp2 values(3, 'USA ' );
select * from joinTrim_temp1 a join joinTrim_temp2 b on a.charToJoin1 = b.charToJoin2;
The output of the query is as below:
In the query you can see that there is a trailing space in the second table. So no inner join should take place. But it seems that Redshift is able to trim the trailing whitespaces when joining.
I encountered this problem, while converting the existing Redshift sql code to PySpark.
Regards, Kumar
回答1:
Ah! Indeed, a very interesting find!
From Character Types - Amazon Redshift:
Trailing spaces in VARCHAR and CHAR values are treated as semantically insignificant when values are compared.
It appears that, if you wish to force the comparison, would you need to avoid the trailing space, such as:
SELECT *
FROM joinTrim_temp1 a
JOIN joinTrim_temp2 b
ON a.charToJoin1 || '.' = b.charToJoin2 || '.';
来源:https://stackoverflow.com/questions/53569896/why-redshift-automatically-trims-varchar-column-when-joining