问题
I have a general question about this. There are many times we want to change data-types of fields or collations when lots of data is inserted before. Consider these situations :
converting
varchar
collation fromutf8_general_ci
tolatin1_swedish_ci
: as I know the first has multibyte chars and the second singly byte ones. Does this conversion manipulate stored records correctly? And does this conversion lead to reduction of volume of existing data (maybe 50%)?Conversion of
int(10)
tosmallint(5)
: Does the volume of data reduce to 50% correctly?Or for example:
int(10)
tounsigned int(10)
-text
tovarchar(1000)
-varchar(20)
tochar(10)
, ...
As it is obvious, these actions might be done to increase efficiency, reduce volume of data and ...
Consider I have a table with 1,000,000 records. I want to know if doing such actions have bad effects on stored data, or if it makes low performance for future inserts and selects involving this table.
UPDATE :
When I talk about changing utf8 encoding charset to Latin, of course the values of my field are English (it's obvious if there are Japanese, they will be lost). With this assumption, I'm asking about the resulting table size and performance.
回答1:
Converting
varchar
collation fromutf8_general_ci
tolatin1_swedish_ci
: As I know the first has multibyte chars and the second singly byte ones. Does this conversion manipulate stored records correctly? And does this conversion lead to reduction of volume of existing data (maybe 50%)?Collation is merely the ordering that is used for string comparisons—it has (almost) nothing to do with the character encoding that is used for data storage. I say almost because collations can only be used with certain character sets, so changing collation may force a change in the character encoding.
To the extent that the character encoding is modified, MySQL will correctly re-encode values to the new character set whether going from single to multi-byte or vice-versa. Beware that any values that become too large for the column will be truncated.
Provided that the new character type is of variable-length and that the values are encoded with fewer bytes in the new encoding than before, there will of course be a reduction in the table's size.
Conversion of
int(10)
tosmallint(5)
: Does the volume of data reduce to 50% correctly?INT
andSMALLINT
respectively occupy 4 and 2 bytes regardless of display width: so yes, the size of the table will reduce accordingly.Or for example:
int(10)
tounsigned int(10)
-text
tovarchar(1000)
-varchar(20)
tochar(10)
, ...INT
occupies 4 bytes irrespective of whether it is signed, so there will be no change;TEXT
andVARCHAR(1000)
both occupy L+2 bytes (where L is the value's length in bytes), so there will be no change;VARCHAR(20)
occupies L+1 bytes (where L is the value's length in bytes) whereasCHAR(10)
occupies 10×w bytes (where w is the number of bytes required for the maximum-length character in the character set), so there may well be a change but it is dependent on the actual values stored and the character encoding used.
Note that, depending on storage engine, reductions in table size may not immediately be released to the filesystem.
回答2:
A1. collation does not change your data. it changes sort order in your queries, and possibly changes your indices (?).
A2. length of the data in the column will be reduced, however you always have some overhead per table row, and you cannot change that. moreover, if your data is not unique, you will not see much reduction in index size, because your index looks like this: 33->{row1,row2,row3...},67->{row9,row0,row7} and every row pointer is much larger than an int.
in other words, if you had a table with a hundred int rows, without many indices, and changed all these columns to tinyint, you would see a notable improvement. if it is only one column, don't bother.
http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html http://dev.mysql.com/doc/refman/5.0/en/innodb-physical-record.html
A3. please read up on text vs varchar. the earlier stores data separately from table row, the latter in the row. each has own implications.
p.s. row and index overhead depends a lot on what db engine you use. normally you should use innodb. however for read-only tasks, e.g. data mining, myisam is more efficient.
回答3:
- Converting
varchar
collation fromutf8_general_ci
tolatin1_swedish_ci
: It can reduce table(file) size, but you can lose not latin symbols, only english words will be stored correctly. - Conversion of
int(10)
tosmallint(5)
- it will reduce the volume of data. Conversion ofint(10)
tounsigned int(10)
- it won't reduce. In these cases you should care about the values, you can get an error - out of range value. - Conversion
varchar(20)
tochar(10)
: CHARs are used for strings that always have the same length (for example - 10), if the strings are different in length, then use VARCHAR data type.
来源:https://stackoverflow.com/questions/13950021/mysql-converting-datatypes-and-collations-effect-on-stored-data