I\'ve got a database with a bunch of broken utf8 characters scattered across several tables. The list of characters isn\'t very extensive AFAIK (áéíúóÁÉÍÓÚÑñ)
Fixing
This query helped me identify rows that had bad characters in them. Basically you identify where the field isn't null, then convert to UTF8 and check for null after conversion.
select ach.*
from ach_warehouse ach
where addendum is not null and convert(addendum using utf8) is null;
No text replacement is a universal solutions because you can forget some character. A more suitable fix for double converted characters is:
Like this:
alter table descriptions modify name VARCHAR(2000) character set latin1;
alter table descriptions modify name blob;
alter table descriptions modify name VARCHAR(2000) character set utf8;
The middle step may not have been necessary since TEXT
and BLOB
are the same.
This had the effect of correcting all incorrectly encoded data but data which was previously correct was truncated at the first multibyte character.
Based on data in this post https://www.i18nqa.com/debug/utf8-debug.html I'd suggest this is a good query of identifying dodgy entries and possible correct values:
SELECT my_field,CONVERT(BINARY CONVERT(my_field USING latin1) USING utf8mb4) AS new_field_value FROM my_table WHERE my_field REGEXP '[âÆËÅÂÃ]';
Be very careful because we had a bad encoding of a file name, but an OK encoding of the path, and in that case some of the solutions above would have caused a world of pain. If some of your data is already correctly encoded in UTF8 you'll likely find you lost a chunk of it.
As the main question was on detecting broken chars my solution: (To prevent normal charset double encoding)
SELECT name FROM %table%
WHERE
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);
UPDATE %table% SET name = convert(cast(convert(name using latin1 ) as binary) using utf8 )
WHERE
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);
this also solved my problem on some italian chars
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'á','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ä','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'é','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ó','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ú','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ñ','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'–','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'’','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'…','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'–','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'“','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‘','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'•','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í ','à');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¨','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íˆ','È');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'€','€');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'eÌ€','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í²','ò');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¹','ù');