How to detect UTF-8 characters in a Latin1 encoded column - MySQL

后端未结

关注

 4  1376

天命终不由人

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.

At this point I simply want to check what sort of data I ha

相关标签:

4条回答

不思量自难忘°

2020-11-28 04:26
Since your question is not completely clear, let's assume some scenarios:
1. Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
2. Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
3. Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-11-28 04:26

I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.

Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace Ã¶ with Latin-1 ö, etc).

0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-11-28 04:33
Character encoding, like time zones, is a constant source of problems.

What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.

To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:
```
SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')
```
This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.

This produces results like this:
```
latin1                utf8
----------------------------------------
BjÃ¶rn                Björn
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲哀的现实

2020-11-28 04:36

There is a script on github to help with this sort of a thing.

0 讨论(0)
发布评论:

提交评论
- 加载中...