PostgreSQL, find strings differ by n characters

别等时光非礼了梦想. 提交于 2019-12-10 11:27:11

问题


Suppose I have a table like this

id  data
1   0001
2   1000
3   2010
4   0120
5   0020
6   0002

sql fiddle demo

id is primary key, data is fixed length string where characters could be 0, 1, 2.

Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.

Thanks.


回答1:


There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:

SELECT *
FROM   a
WHERE  levenshtein(data, '1110') = 1;

SQL Fiddle.

But it is not very fast. Slow with big tables, because it can't use an index.

You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".

Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.

If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:

SELECT *
FROM   a
WHERE  data <> '1110'
AND   (data LIKE '_110' OR
       data LIKE '1_10' OR
       data LIKE '11_0' OR
       data LIKE '111_');

Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.

However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.

I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.



来源:https://stackoverflow.com/questions/18352332/postgresql-find-strings-differ-by-n-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!