How to stop . being treated as a separator in SQLite FTS4

戏子无情 提交于 2019-12-11 03:36:24

问题


I want to be able to search for numbers like 2.3 using FTS4 in SQLite, but the . is being treated as a token boundary. Short of writing a full bespoke tokenizer is there any other way of excluding the . from the list of token boundary characters?

Being able to search for decimal numbers seems like a common use case, but I can't find anything relevant on SO / Google. My best solution at present is to replace all . chars in the text with a known (long) string of letters and substitute accordingly on each search...

Peter


回答1:


The tokenizer defines what tokens are, so you would indeed need to write your own.

You could search for the phrase "2 3", which would find the numbers 2 and 3 with any separator.




回答2:


Actually you don't need to write your own tokenizer.

The 'simple' tokenizer allows the delimiters to be customized, but it's an undocumented feature.

Create your FTS table as follows, and SQLite will tokenize on ' ' (space) and '#' (hash) characters only:

CREATE VIRTUAL TABLE documents USING fts4(title, content, tokenize=simple '' '# ');

There was a discussion on the SQLite mailing list about this feature in 2012 here. The source code in SQLite which supports this feature is here.

Quote from the mailing list regarding its lack of documentation:

"Likely the reason is that we forgot that this feature even exists. It seems to have existed in the simple tokenizer, unchanged, since the original introduction of FTS1 back in 2006."

Quote regarding whether it's safe to use the feature:

"But it has been in the code for so long now that we dare not change it for fear of breaking long-established programs."

...however it was also mentioned that the feature is not likely to have been tested thoroughly.

I have just emailed the SQLite users mailing list, asking if this feature can be documented.

Update: One thing I'd add is that we found that if characters such as '*' or '-' are configured as delimiters like this, then they would also be stripped out of the FTS MATCH clause in queries. So while this works fine for '.' etc., it can't be used for characters which are also special characters in FTS MATCH.




回答3:


You don't need to write your own tokenizer: some built in tokenizers allow you at table creation time to specify that some characters should or should not be treated as word characters. For example, the unicode61 tokenizer (see the FTS3 docs, you may have to enable support for it at build time) allows you to pass tokenchars and separators parameters (poke around in the source).

In your case, I believe you could use something like this:

CREATE VIRTUAL TABLE whatever USING fts4(
    columns,

    tokenize=unicode61 "tokenchars=."
);


来源:https://stackoverflow.com/questions/16080307/how-to-stop-being-treated-as-a-separator-in-sqlite-fts4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!