How to avoid Junk/garbage characters while reading data from multiple languages?

后端 未结 1 1131
甜味超标
甜味超标 2020-12-04 03:57

I am parsing rss news feeds from over 10 different languages.

All the parsing is being done in java and data is stored in MySQL before my API\'s written in php are r

相关标签:
1条回答
  • 2020-12-04 04:32

    The Gujarati starts રેલવે, correct? And the Malyalam starts നേപ, correct? And the English should have included Bureau’s.

    This is the classic case of

    • The bytes you have in the client are correctly encoded in utf8. (Bureau is encoded in the Ascii/latin1 subset of utf8; but is not the ascii apostrophe.)
    • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
    • The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)

    The fix for the data is a "2-step ALTER".

    ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
    ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
    

    where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.

    Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.

    The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.

    Edit

    You have VARCHAR with the wrong CHARACTER SET. Hence, you see Mojibake like રેલ. Most conversion techniques try to preserve રેલ, but that is not what you need. Instead, taking a step to VARBINARY preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.

    0 讨论(0)
提交回复
热议问题