I am parsing rss news feeds from over 10 different languages.
All the parsing is being done in java and data is stored in MySQL before my API\'s written in php are r
The Gujarati starts રેલવે, correct? And the Malyalam starts നേപ, correct? And the English should have included Bureau’s.
This is the classic case of
Bureau is encoded in the Ascii/latin1 subset of utf8; but ’ is not the ascii apostrophe.)SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)The fix for the data is a "2-step ALTER".
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.
Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.
The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.
Edit
You have VARCHAR with the wrong CHARACTER SET. Hence, you see Mojibake like રેલ. Most conversion techniques try to preserve રેલ, but that is not what you need. Instead, taking a step to VARBINARY preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.