How to detect Chinese Character in MySQL?

巧了我就是萌 提交于 2020-04-17 04:16:05

问题


I need to calculate the number of Chinese in a list of columns. For Example, if "北京实业" occur, this is four characters in Chinese but I only count once since it occurs in the column.

Is there any specific code to figure this out?


回答1:


SELECT COUNT(*)
    FROM tbl
    WHERE HEX(col) REGEXP '^(..)*(E[2-9F]|F0A)'

will count the number of record with Chinese characters in column col.

Problems:

  • I am not sure what ranges of hex represent Chinese.
  • The test may include Korean and Japanese. ("CJK")
  • In MySQL 4-byte Chinese characters need utf8mb4 instead of utf8.

Elaboration

I am assuming the column in the table is CHARACTER SET utf8. In utf8 encoding, Chinese characters begin with a byte between hex E2 and E9, or EF, or F0. Those starting with hex E will be 3 bytes long, but I am not checking the length; the F0 ones will be 4 bytes.

The regexp starts with ^(..)*, meaning "from the start of the string (^), locate 0 or more (*) 2-character (..) values. After that should be either E-something or F0A. After that, anything can occur. The E-something is, more specifically, E followed by any of 2,3,4,5,6,7,8,9, or F.

Picked at random, I see that encodes as the 3 hex bytes E88D89, and 𠜎 encodes as the 4 hex bytes F0A09C8E.

I do not know of a better way to check a string for a specific language.

As you found, the REGEXP can be rather slow.

This regexp could be over-kill, in that some non-Chinese characters may be captured.



来源:https://stackoverflow.com/questions/35061775/how-to-detect-chinese-character-in-mysql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!