MySQL: Converting datatypes and collations effect on stored data

╄→гoц情女王★ 提交于 2019-12-23 03:06:33

问题


I have a general question about this. There are many times we want to change data-types of fields or collations when lots of data is inserted before. Consider these situations :

  1. converting varchar collation from utf8_general_ci to latin1_swedish_ci: as I know the first has multibyte chars and the second singly byte ones. Does this conversion manipulate stored records correctly? And does this conversion lead to reduction of volume of existing data (maybe 50%)?

  2. Conversion of int(10) to smallint(5): Does the volume of data reduce to 50% correctly?

  3. Or for example: int(10) to unsigned int(10) - text to varchar(1000) - varchar(20) to char(10) , ...

As it is obvious, these actions might be done to increase efficiency, reduce volume of data and ...

Consider I have a table with 1,000,000 records. I want to know if doing such actions have bad effects on stored data, or if it makes low performance for future inserts and selects involving this table.

UPDATE :
When I talk about changing utf8 encoding charset to Latin, of course the values of my field are English (it's obvious if there are Japanese, they will be lost). With this assumption, I'm asking about the resulting table size and performance.


回答1:


  1. Converting varchar collation from utf8_general_ci to latin1_swedish_ci: As I know the first has multibyte chars and the second singly byte ones. Does this conversion manipulate stored records correctly? And does this conversion lead to reduction of volume of existing data (maybe 50%)?

    Collation is merely the ordering that is used for string comparisons—it has (almost) nothing to do with the character encoding that is used for data storage. I say almost because collations can only be used with certain character sets, so changing collation may force a change in the character encoding.

    To the extent that the character encoding is modified, MySQL will correctly re-encode values to the new character set whether going from single to multi-byte or vice-versa. Beware that any values that become too large for the column will be truncated.

    Provided that the new character type is of variable-length and that the values are encoded with fewer bytes in the new encoding than before, there will of course be a reduction in the table's size.

  2. Conversion of int(10) to smallint(5): Does the volume of data reduce to 50% correctly?

    INT and SMALLINT respectively occupy 4 and 2 bytes regardless of display width: so yes, the size of the table will reduce accordingly.

  3. Or for example: int(10) to unsigned int(10) - text to varchar(1000) - varchar(20) to char(10), ...

    • INT occupies 4 bytes irrespective of whether it is signed, so there will be no change;

    • TEXT and VARCHAR(1000) both occupy L+2 bytes (where L is the value's length in bytes), so there will be no change;

    • VARCHAR(20) occupies L+1 bytes (where L is the value's length in bytes) whereas CHAR(10) occupies 10×w bytes (where w is the number of bytes required for the maximum-length character in the character set), so there may well be a change but it is dependent on the actual values stored and the character encoding used.

Note that, depending on storage engine, reductions in table size may not immediately be released to the filesystem.




回答2:


A1. collation does not change your data. it changes sort order in your queries, and possibly changes your indices (?).

A2. length of the data in the column will be reduced, however you always have some overhead per table row, and you cannot change that. moreover, if your data is not unique, you will not see much reduction in index size, because your index looks like this: 33->{row1,row2,row3...},67->{row9,row0,row7} and every row pointer is much larger than an int.

in other words, if you had a table with a hundred int rows, without many indices, and changed all these columns to tinyint, you would see a notable improvement. if it is only one column, don't bother.

http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html http://dev.mysql.com/doc/refman/5.0/en/innodb-physical-record.html

A3. please read up on text vs varchar. the earlier stores data separately from table row, the latter in the row. each has own implications.

p.s. row and index overhead depends a lot on what db engine you use. normally you should use innodb. however for read-only tasks, e.g. data mining, myisam is more efficient.




回答3:


  • Converting varchar collation from utf8_general_ci to latin1_swedish_ci: It can reduce table(file) size, but you can lose not latin symbols, only english words will be stored correctly.
  • Conversion of int(10) to smallint(5) - it will reduce the volume of data. Conversion of int(10) to unsigned int(10) - it won't reduce. In these cases you should care about the values, you can get an error - out of range value.
  • Conversion varchar(20) to char(10): CHARs are used for strings that always have the same length (for example - 10), if the strings are different in length, then use VARCHAR data type.


来源:https://stackoverflow.com/questions/13950021/mysql-converting-datatypes-and-collations-effect-on-stored-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!