S3 -> Redshift cannot handle UTF8

烈酒焚心 提交于 2019-11-28 12:08:25

tl;dr

the byte length for your varchar column just needs to be larger.

Detail

Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.

AWS documentation for Multibyte Character Load Errors states the following:

VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

Therefore if you want the character Ä to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.

AWS documentation for VARCHAR or CHARACTER VARYING states the following:

... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.

For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8

Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.

Using "ACCEPTINVCHARS ESCAPE" in the copy command solved the issue for us with minor data alteration.

I have a similar experience which some only characters like Ä were not copied correctly when loading mysqldump data into our Redshift cluster. It was because the encoding of mysqldump was latin1 which is the default character set of mysql. It's better to check the character encoding of the files for COPY first. If the encoding of your files are not UTF-8, you have to encode your files.

You need to increase the size of your varchar column. Check the stl_load_errors table, see what is the actual field value length for the failed rows and accordingly increase the size. EDIT: Just realized this is a very old post, anyways if someone need it..

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!