How to identify a language in utf-8 column in MySQL

℡╲_俬逩灬. 提交于 2020-01-23 18:28:29

问题


My question is how to find specific character set from utf-8 column in MySQL server?

Please note that this is NOT Duplicate question, please read carefully what's asked, not what's you think.

Currently MySQL does works perfectly with utf-8 and shows all types of different languages and I don't have any problem to see different languages in database. I use SQLyog to connect MySQL server and all SELECT results are perfect, I can see Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages are mixed and shows perfectly. As well my.ini and scripts also perfectly configured and working well.

Here How can I find non-ASCII characters in MySQL? I see that some people answers the question and their answers also perfect to find non ASCII text. but my question is similar, but little different. I want to find specific character set from utf-8 column in MySQL server.

let's say,

select * from TABLE where COLUMN regexp '[^ -~]';

it returns all non ASCII characters including Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages. but I want is

SELECT * from TABLE WHERE COLUMN like or regexp'Japanese text only?'

another words, I want SELECT only Japanese encoded text. currently I can see all types of language with this;

select * from TABLE where COLUMN regexp '[^ -~]';

but I want select only japanese or russian or arabic or french language. how to do that?

Database contains all languages mixed rows and UTF-8. I am not sure is it possible in MySQL Server? if not possible, then how to do this?

Thanks a lot!


回答1:


Well, let's start with a table I put in here. It says, for example, that E381yy is the utf8 encoding for Hiragana and E383yy is Katakana (Japanese). (Kanji is another matter.)

To see if a utf8 column contains Katakana, do something like

WHERE HEX(col) REGEXP '^(..)*E383'

Cyrillic might be

WHERE HEX(col) REGEXP '^(..)*D[0-4]'

Chinese is a bit tricky, but this might usually work for Chinese (and Kanji?):

WHERE HEX(col) REGEXP '^(..)*E[4-9A]'

(I'm going to change your Title to avoid the keyword 'character set'.)

Western Europe (including, but not limited to, French) C[23], Turkish (approx, and some others) (C4|C59), Greek: C[EF], Hebrew: D[67], Indian, etc: E0, Arabic/Farsi/Persian/Urdu: D[89AB]. (Always prefix with ^(..)*.

You may notice that these are not necessarily very specific. This is because of overlaps. British English and American English cannot be distinguished except by spelling of a few words. Several accented letters are shared in various ways in Europe. India has many different character sets: Devanagari, Bengali, Gurmukhi, Gujarati, etc.; these are probably distinguishable, but it would take more research. I think Arabic/Farsi/Persian/Urdu share one character set.

Some more:

| SAMARITAN                     | E0A080        | E0A0BE        |
| DEVANAGARI                    | E0A480        | E0A5BF        |
| BENGALI                       | E0A681        | E0A7BB        |
| GURMUKHI                      | E0A881        | E0A9B5        |
| GUJARATI                      | E0AA81        | E0ABB1        |
| ORIYA                         | E0AC81        | E0ADB1        |
| TAMIL                         | E0AE82        | E0AFBA        |
| TELUGU                        | E0B081        | E0B1BF        |
| KANNADA                       | E0B282        | E0B3B2        |
| MALAYALAM                     | E0B482        | E0B5BF        |
| SINHALA                       | E0B682        | E0B7B4        |
| THAI                          | E0B881        | E0B99B        |
| LAO                           | E0BA81        | E0BB9D        |
| TIBETAN                       | E0BC80        | E0BF94        |

So, for DEVANAGARI, '^(..)*E0A[45]'



来源:https://stackoverflow.com/questions/37063793/how-to-identify-a-language-in-utf-8-column-in-mysql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!