Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Korean)?

喜夏-厌秋 提交于 2020-01-04 01:20:33

问题


I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese Korean I verify my sqlite supports fts5

sqlite> pragma compile_options;
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5

But to my surprise it can't find any CJK word at all. Why is that ?

sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize = 'porter unicode61 remove_diacritics 1');
sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work for chinese');
sqlite> select * from ft5_test where ft5_test = '中文';
sqlite>
sqlite> select * from ft5_test where ft5_test = 'Chinese';
为什么不支持中文 fts5 does not seem to work for chinese

------------- update ----------

I spend quite some time in building an icu version. I shared my experience here https://stackoverflow.com/a/52866566/301513

From what I have learned using icu version is the only way to support CJK and fts5 has not support icu tokenizer.

I leave my question here in case others may have new ideas about the problem.

来源:https://stackoverflow.com/questions/52422437/why-sqlite-fts5-unicode61-tokenizer-does-not-support-cjkchinese-japanese-korean

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!