问题
I've been researching this question for several hours now, on SO, in MySQL docs, and elsewhere, but still can't find a satisfactory solution. The problem is:
What is the simplest way to make MySQL treat strings just like SQLite does, without any extra "smart" conversions?
For example, the following works perfectly in SQLite:
CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);
INSERT INTO `dummy` (`key`) VALUES ('one');
INSERT INTO `dummy` (`key`) VALUES ('one ');
INSERT INTO `dummy` (`key`) VALUES ('One');
INSERT INTO `dummy` (`key`) VALUES ('öne');
SELECT * FROM `dummy`;
However, in MySQL, with the following settings:
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_bin
and the following CREATE DATABASE
statement:
CREATE DATABASE `dummydb` DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_bin;
it still fails on the second INSERT
.
I'd rather keep string column declarations as simple as possible, SQLite's TEXT
being the ideal. Looks like VARBINARY
is the way to go, but I would still like to hear your opinions on any other, potentially better options.
Addendum: The SHOW CREATE TABLE dummy
output is
mysql> SHOW CREATE TABLE dummy;
+-------+-----------------------------------------------------
| Table | Create Table
+-------+-----------------------------------------------------
| dummy | CREATE TABLE `dummy` (
`key` varchar(255) COLLATE utf8mb4_bin NOT NULL,
UNIQUE KEY `key` (`key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+-------+-----------------------------------------------------
1 row in set (0.00 sec)
回答1:
MySQL wants to convert strings when doing INSERT
and SELECT
. The conversion is between what you declare the client to have and what the column is declared to be storing.
The only way to avoid that is with VARBINARY
and BLOB
instead of VARCHAR
and TEXT
.
The use of COLLATION utf8mb4_bin
does not avoid conversion to/from CHARACTER SET utf8mb4
; it merely says that WHERE
and ORDER BY
should compare the bits instead of dealing with accents and case folding.
Keep in mind that CHARACTER SET utf8mb4
is a way to encode text; COLLATION utf8mb4_*
is rules for comparing texts in that encoding. _bin
is simpleminded.
UNIQUE
involves comparing for equality, hence COLLATION
. In most utf8mb4 collations, the 3 (without spaces) will compare equal. utf8mb4_bin
will treat the 3 as different. utf8mb4_hungarian_ci
treats one=One>öne.
The trailing spaces are controlled by the datatype of the column (VARCHAR
or other). The latest version even has a setting relating to whether to consider trailing spaces.
回答2:
The approach shown in the question should (mostly) work just fine in MySQL for the following reasons:
Collation (not to be confused with encoding) is the set or rules that define how to sort and compare characters, typically used to replicate at database level the user expectations from a cultural perspective (if I search for
cafe
I expect to findcafé
as well).Collation plays an important rule on unique constraints because its establishes the definition of unique.
Binary collations are specifically meant to ignore cultural rules and work at byte level, thus
utf8mb4_bin
is the right choice here.MySQL allows to set a combination of encoding and collation with a column level granularity.
If a column definition is missing collation, it'll use the table level one.
If a table definition is missing collation, it'll use the database level one.
If a database definition is missing collation, it'll use the server level one.
It's also worth noting that MySQL will convert between encodings transparently as long as:
- Connection encoding is properly set
- Conversion is physically possible (e.g. all source characters also belong to target encoding)
For this last reason, VARBINARY
is possibly not the best choice for a column that's still text because it opens the door to getting café
stored from a connection configured to use ISO-8859-1 and not being able to retrieve it correctly from a connection configured to use UTF-8.
Side note: the table definition shown may trigger the following error:
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes
Indexes may have a relatively small maximum size. From docs:
If innodb_large_prefix is enabled (the default), the index key prefix limit is 3072 bytes for InnoDB tables that use DYNAMIC or COMPRESSED row format. If innodb_large_prefix is disabled, the index key prefix limit is 767 bytes for tables of any row format.
innodb_large_prefix is deprecated and will be removed in a future release. innodb_large_prefix was introduced in MySQL 5.5 to disable large index key prefixes for compatibility with earlier versions of InnoDB that do not support large index key prefixes.
The index key prefix length limit is 767 bytes for InnoDB tables that use the REDUNDANT or COMPACT row format. For example, you might hit this limit with a column prefix index of more than 255 characters on a TEXT or VARCHAR column, assuming a utf8mb3 character set and the maximum of 3 bytes for each character.
Attempting to use an index key prefix length that exceeds the limit returns an error. To avoid such errors in replication configurations, avoid enabling innodb_large_prefix on the master if it cannot also be enabled on slaves.
Since utf8_mb8 allocates 4 bytes per character, a 767 limit will be overflowed with only 192 characters.
We have one more problem:
mysql> CREATE TABLE `dummy` (
-> `key` varchar(191) COLLATE utf8mb4_bin NOT NULL,
-> UNIQUE KEY `key` (`key`)
-> )
-> ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('one');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('one ');
ERROR 1062 (23000): Duplicate entry 'one ' for key 'key'
Pardon?
mysql> INSERT INTO `dummy` (`key`) VALUES ('One');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('öne');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM `dummy`;
+-----+
| key |
+-----+
| One |
| one |
| öne |
+-----+
3 rows in set (0.00 sec)
This last issue is a interesting subtlety of MySQL collations. From docs:
All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the LIKE pattern-matching operator, for which trailing spaces are significant
[...] For those cases where trailing pad characters are stripped or comparisons ignore them, if a column has an index that requires unique values, inserting into the column values that differ only in number of trailing pad characters will result in a duplicate-key error.
I'd dare say then that VARBINARY
type is the only way to overcome this...
来源:https://stackoverflow.com/questions/42371220/how-to-make-mysql-handle-strings-like-sqlite-does-with-regard-to-unicode-and-co