问题
I used the chinese letters in Cassandra and it seems the data is entered properly like below,
SELECT * FROM user;
user_id | user_name | user_phone
---------+--------------+-------------
23 | uSer23, | 12345678910
5 | uSer5^ | 12345678910
28 | uSer28名 | 12345678910
10 | uSer10- | 12345678910
16 | uSer16{ | 12345678910
13 | uSer13= | 12345678910
30 | uSer30一些 | 12345678910
11 | uSer11_ | 12345678910
1 | uSer1@ | 12345678910
19 | uSer19" | 12345678910
8 | uSer8( | 12345678910
0 | uSer0! | 12345678910
2 | uSer2# | 12345678910
4 | uSer4% | 12345678910
18 | uSer18[ | 12345678910
15 | uSer15} | 12345678910
22 | uSer22< | 12345678910
27 | uSer27/ | 12345678910
20 | uSer20: | 12345678910
7 | uSer7* | 12345678910
6 | uSer6& | 12345678910
29 | uSer29称 | 12345678910
9 | uSer9) | 12345678910
14 | uSer14| | 12345678910
26 | uSer26? | 12345678910
21 | uSer21; | 12345678910
17 | uSer17] | 12345678910
31 | uSer31区中文 | 12345678910
24 | uSer24> | 12345678910
25 | uSer25. | 12345678910
12 | uSer12+ | 12345678910
3 | uSer3$ | 12345678910
I created a index for 'user_name' field like below,
CREATE CUSTOM INDEX user_nontoken_idx ON QCS.user (user_name)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class':
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
When I do a search using those chinese word, It is searched successfully.
SELECT * FROM user WHERE user_name LIKE '%称%';
How does it actually works? How Cassandra has the capability to store chinese?
回答1:
By default, the text is represented in Cassandra as UTF-8 as it was mentioned in comment.
For your question the main work is done by SASI that gets the data from text column, and apply analyzer to it - and in most cases, for analyzer, the Chinese characters are like other characters. Although if you plan to index text columns, then you may need to look to StandardAnalyzer. But for user names, or something like, NonTokenizingAnalyzer could be better.
回答2:
The ability of understanding language specific strings, comes from the fact that the "TEXT" datatype (of "user_name" column here) is
"UTF-8 encoded string"
in Cassandra. Comparing this with if the column "user_name" would have been stored as "ascii" then it understands only US-ASCII character string set.
来源:https://stackoverflow.com/questions/49219277/chinese-language-in-cassandra