S3 -> Redshift cannot handle UTF8

↘锁芯ラ 提交于 2019-11-27 06:47:19

问题


We have a file in S3 that is loaded in to Redshift via the COPY command. The import is failing because a VARCHAR(20) value contains an Ä which is being translated into .. during the copy command and is now too long for the 20 characters.

I have verifies that the data is correct in S3, but COPY does not understand the UTF-8 characters during import. Has anyone found a solution for this?


回答1:


tl;dr

the byte length for your varchar column just needs to be larger.

Detail

Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.

AWS documentation for Multibyte Character Load Errors states the following:

VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

Therefore if you want the character Ä to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.

AWS documentation for VARCHAR or CHARACTER VARYING states the following:

... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.

For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8

Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.




回答2:


Please check below link

http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html

You should use ACCEPTINVCHARS in you copy command. Details here

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html#acceptinvchars




回答3:


Using "ACCEPTINVCHARS ESCAPE" in the copy command solved the issue for us with minor data alteration.




回答4:


I have a similar experience which some only characters like Ä were not copied correctly when loading mysqldump data into our Redshift cluster. It was because the encoding of mysqldump was latin1 which is the default character set of mysql. It's better to check the character encoding of the files for COPY first. If the encoding of your files are not UTF-8, you have to encode your files.




回答5:


You need to increase the size of your varchar column. Check the stl_load_errors table, see what is the actual field value length for the failed rows and accordingly increase the size. EDIT: Just realized this is a very old post, anyways if someone need it..



来源:https://stackoverflow.com/questions/27612294/s3-redshift-cannot-handle-utf8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!