Does Postgresql varchar count using unicode character length or ASCII character length?

前端 未结 2 407
野性不改
野性不改 2020-12-31 00:09

I tried importing a database dump from a SQL file and the insert failed when inserting the string Mér into a field defined as varying(3). I didn\'t

相关标签:
2条回答
  • 2020-12-31 00:47

    The length limit imposed by varchar(N) types and calculated by the length function is in characters, not bytes. So 'abcdef'::char(3) is truncated to 'abc' but 'a€cdef'::char(3) is truncated to 'a€c', even in the context of a database encoded as UTF-8, where 'a€c' is encoded using 5 bytes.

    If restoring a dump file complained that 'Mér' would not go into a varchar(3) column, that suggests you were restoring a UTF-8 encoded dump file into a SQL_ASCII database.

    For example, I did this in a UTF-8 database:

    create schema so4249745;
    create table so4249745.t(key varchar(3) primary key);
    insert into so4249745.t values('Mér');
    

    And then dumped this and tried to load it into a SQL_ASCII database:

    pg_dump -f dump.sql --schema=so4249745 --table=t
    createdb -E SQL_ASCII -T template0 enctest
    psql -f dump.sql enctest
    

    And sure enough:

    psql:dump.sql:34: ERROR:  value too long for type character varying(3)
    CONTEXT:  COPY t, line 1, column key: "Mér"
    

    By contrast, if I create the database enctest as encoding LATIN1 or UTF8, it loads fine.

    This problem comes about because of a combination of dumping a database with a multi-byte character encoding, and trying to restore it into a SQL_ASCII database. Using SQL_ASCII basically disables the transcoding of client data to server data and assumes one byte per character, leaving it to the clients to take responsibility for using the right character map. Since the dump file contains the stored string as UTF-8, that is four bytes, so a SQL_ASCII database sees that as four characters, and therefore regards it as violating the constraint. And it prints out the value, which my terminal then reassembles as three characters.

    0 讨论(0)
  • 2020-12-31 01:10

    It depends what value you used when you created the database. createdb -E UNICODE creates a Unicode DB that should also accept multibyte characters and count them as one character.

    You can use

    psql -l
    

    to see which encoding was used. This page has a table including information about how many bytes per character are used.

    0 讨论(0)
提交回复
热议问题