问题
My question is how to find specific character set from utf-8 column in MySQL server?
Please note that this is NOT Duplicate question, please read carefully what's asked, not what's you think.
Currently MySQL does works perfectly with utf-8 and shows all types of different languages and I don't have any problem to see different languages in database. I use SQLyog to connect MySQL server and all SELECT results are perfect, I can see Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages are mixed and shows perfectly. As well my.ini and scripts also perfectly configured and working well.
Here How can I find non-ASCII characters in MySQL? I see that some people answers the question and their answers also perfect to find non ASCII text. but my question is similar, but little different. I want to find specific character set from utf-8 column in MySQL server.
let's say,
select * from TABLE where COLUMN regexp '[^ -~]';
it returns all non ASCII characters including Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages. but I want is
SELECT * from TABLE WHERE COLUMN like or regexp'Japanese text only?'
another words, I want SELECT only Japanese encoded text. currently I can see all types of language with this;
select * from TABLE where COLUMN regexp '[^ -~]';
but I want select only japanese or russian or arabic or french language. how to do that?
Database contains all languages mixed rows and UTF-8. I am not sure is it possible in MySQL Server? if not possible, then how to do this?
Thanks a lot!
回答1:
Well, let's start with a table I put in here. It says, for example, that E381yy is the utf8 encoding for Hiragana and E383yy is Katakana (Japanese). (Kanji is another matter.)
To see if a utf8 column contains Katakana, do something like
WHERE HEX(col) REGEXP '^(..)*E383'
Cyrillic might be
WHERE HEX(col) REGEXP '^(..)*D[0-4]'
Chinese is a bit tricky, but this might usually work for Chinese (and Kanji?):
WHERE HEX(col) REGEXP '^(..)*E[4-9A]'
(I'm going to change your Title to avoid the keyword 'character set'.)
Western Europe (including, but not limited to, French) C[23]
, Turkish (approx, and some others) (C4|C59)
, Greek: C[EF]
, Hebrew: D[67]
, Indian, etc: E0
, Arabic/Farsi/Persian/Urdu: D[89AB]
. (Always prefix with ^(..)*
.
You may notice that these are not necessarily very specific. This is because of overlaps. British English and American English cannot be distinguished except by spelling of a few words. Several accented letters are shared in various ways in Europe. India has many different character sets: Devanagari, Bengali, Gurmukhi, Gujarati, etc.; these are probably distinguishable, but it would take more research. I think Arabic/Farsi/Persian/Urdu share one character set.
Some more:
| SAMARITAN | E0A080 | E0A0BE |
| DEVANAGARI | E0A480 | E0A5BF |
| BENGALI | E0A681 | E0A7BB |
| GURMUKHI | E0A881 | E0A9B5 |
| GUJARATI | E0AA81 | E0ABB1 |
| ORIYA | E0AC81 | E0ADB1 |
| TAMIL | E0AE82 | E0AFBA |
| TELUGU | E0B081 | E0B1BF |
| KANNADA | E0B282 | E0B3B2 |
| MALAYALAM | E0B482 | E0B5BF |
| SINHALA | E0B682 | E0B7B4 |
| THAI | E0B881 | E0B99B |
| LAO | E0BA81 | E0BB9D |
| TIBETAN | E0BC80 | E0BF94 |
So, for DEVANAGARI, '^(..)*E0A[45]'
来源:https://stackoverflow.com/questions/37063793/how-to-identify-a-language-in-utf-8-column-in-mysql