How to find almost similar records in sql?

问题

This is the search record:

A = {
    field1: value1,
    field2: value2,
    ...
    fieldN: valueN
}

I have many such records in the database.

Other record (B) almost matches record A if even N-M fields in these records are equal. This is the example, M=2:

B = {
    field1: OTHER_value1,
    field2: OTHER_value2,
    field3: value3,
    ...
    fieldN: valueN
}

If can be any fields, not only the first.

I can make the very big combinatorial sql query, but may be there is more beautiful solution.

P.S.: My database is PostgreSQL.

回答1:

Such a search criteria won't be able to make use of any indexes, but it can be done...

SELECT
  *
FROM
  yourTable
WHERE
  N-M <= CASE WHEN yourTable.field1 = searchValue1 THEN 1 ELSE 0 END
       + CASE WHEN yourTable.field2 = searchValue2 THEN 1 ELSE 0 END
       + CASE WHEN yourTable.field3 = searchValue3 THEN 1 ELSE 0 END
       ...
       + CASE WHEN yourTable.fieldN = searchValueN THEN 1 ELSE 0 END

Similarly, if your search criteria is in another table...

SELECT
  *
FROM
  yourTable
INNER JOIN
  search
    ON N-M <= CASE WHEN yourTable.field1 = search.field1 THEN 1 ELSE 0 END
            + CASE WHEN yourTable.field2 = search.field2 THEN 1 ELSE 0 END
            + CASE WHEN yourTable.field3 = search.field3 THEN 1 ELSE 0 END
            ...
            + CASE WHEN yourTable.fieldN = search.fieldN THEN 1 ELSE 0 END

(You need to populate the value of N-Myourself)

EDIT:

A more long winded approach, that can make some use of indexes...

SELECT
    id,  -- your table would need to have a primary key / identity column
    MAX(field1)   AS field1,
    MAX(field2)   AS field2,
    MAX(field3)   AS field3,
    ...
    MAX(fieldN)   AS fieldN
FROM
(
    SELECT * FROM yourTable WHERE field1 = searchValue1
    UNION ALL
    SELECT * FROM yourTable WHERE field2 = searchValue2
    UNION ALL
    SELECT * FROM yourTable WHERE field3 = searchValue3
    ...
    SELECT * FROM yourTable WHERE fieldN = searchValueN
)
    AS unioned_seeks
GROUP BY
    id
HAVING
    COUNT(*) >= N-M

Where you have an index on each field individually, and where you expect a relatively low number of matches for each field this might outperform the first option, at the expense of very repetitious code.

回答2:

I would do this using is not distinct from to handle NULL values.

You can also use Postgres short-hand to simplify the logic. One way is:

where ( (a.field1 is not distinct from b.field1)::int +
        (a.field2 is not distinct from b.field2)::int +
        . . .
        (a.fieldn is not distinct from b.fieldn)::int +
      ) >= N - M

I think this is easier to express only in terms of M. So, only look at the fields that are different:

where ( (a.field1 is distinct from b.field1)::int +
        (a.field2 is distinct from b.field2)::int +
        . . .
        (a.fieldn is distinct from b.fieldn)::int +
      ) <= M

Doing this with your data requires a cross join which is quite expensive.

来源：https://stackoverflow.com/questions/47877071/how-to-find-almost-similar-records-in-sql

标签

sql

algorithm

postgresql

similarity