How to find almost similar records in sql?

▼魔方 西西 提交于 2019-12-11 07:25:26

问题


This is the search record:

A = {
    field1: value1,
    field2: value2,
    ...
    fieldN: valueN
}

I have many such records in the database.

Other record (B) almost matches record A if even N-M fields in these records are equal. This is the example, M=2:

B = {
    field1: OTHER_value1,
    field2: OTHER_value2,
    field3: value3,
    ...
    fieldN: valueN
}

If can be any fields, not only the first.

I can make the very big combinatorial sql query, but may be there is more beautiful solution.

P.S.: My database is PostgreSQL.


回答1:


Such a search criteria won't be able to make use of any indexes, but it can be done...

SELECT
  *
FROM
  yourTable
WHERE
  N-M <= CASE WHEN yourTable.field1 = searchValue1 THEN 1 ELSE 0 END
       + CASE WHEN yourTable.field2 = searchValue2 THEN 1 ELSE 0 END
       + CASE WHEN yourTable.field3 = searchValue3 THEN 1 ELSE 0 END
       ...
       + CASE WHEN yourTable.fieldN = searchValueN THEN 1 ELSE 0 END

Similarly, if your search criteria is in another table...

SELECT
  *
FROM
  yourTable
INNER JOIN
  search
    ON N-M <= CASE WHEN yourTable.field1 = search.field1 THEN 1 ELSE 0 END
            + CASE WHEN yourTable.field2 = search.field2 THEN 1 ELSE 0 END
            + CASE WHEN yourTable.field3 = search.field3 THEN 1 ELSE 0 END
            ...
            + CASE WHEN yourTable.fieldN = search.fieldN THEN 1 ELSE 0 END

(You need to populate the value of N-Myourself)

EDIT:

A more long winded approach, that can make some use of indexes...

SELECT
    id,  -- your table would need to have a primary key / identity column
    MAX(field1)   AS field1,
    MAX(field2)   AS field2,
    MAX(field3)   AS field3,
    ...
    MAX(fieldN)   AS fieldN
FROM
(
    SELECT * FROM yourTable WHERE field1 = searchValue1
    UNION ALL
    SELECT * FROM yourTable WHERE field2 = searchValue2
    UNION ALL
    SELECT * FROM yourTable WHERE field3 = searchValue3
    ...
    SELECT * FROM yourTable WHERE fieldN = searchValueN
)
    AS unioned_seeks
GROUP BY
    id
HAVING
    COUNT(*) >= N-M

Where you have an index on each field individually, and where you expect a relatively low number of matches for each field this might outperform the first option, at the expense of very repetitious code.




回答2:


I would do this using is not distinct from to handle NULL values.

You can also use Postgres short-hand to simplify the logic. One way is:

where ( (a.field1 is not distinct from b.field1)::int +
        (a.field2 is not distinct from b.field2)::int +
        . . .
        (a.fieldn is not distinct from b.fieldn)::int +
      ) >= N - M

I think this is easier to express only in terms of M. So, only look at the fields that are different:

where ( (a.field1 is distinct from b.field1)::int +
        (a.field2 is distinct from b.field2)::int +
        . . .
        (a.fieldn is distinct from b.fieldn)::int +
      ) <= M

Doing this with your data requires a cross join which is quite expensive.



来源:https://stackoverflow.com/questions/47877071/how-to-find-almost-similar-records-in-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!