How to avoid Junk/garbage characters while reading data from multiple languages?

后端未结

关注

 1  1131

甜味超标

I am parsing rss news feeds from over 10 different languages.

All the parsing is being done in java and data is stored in MySQL before my API\'s written in php are r

相关标签:

1条回答

悲&欢浪女

2020-12-04 04:32
The Gujarati starts રેલવે, correct? And the Malyalam starts നേപ, correct? And the English should have included Bureau’s.

This is the classic case of
- The bytes you have in the client are correctly encoded in utf8. (Bureau is encoded in the Ascii/latin1 subset of utf8; but ’ is not the ascii apostrophe.)
- You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
- The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)
The fix for the data is a "2-step ALTER".
```
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
```
where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.

Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.

The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.

Edit

You have VARCHAR with the wrong CHARACTER SET. Hence, you see Mojibake like àª°à«‡àª². Most conversion techniques try to preserve àª°à«‡àª², but that is not what you need. Instead, taking a step to VARBINARY preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.
0 讨论(0)
发布评论:

提交评论
- 加载中...