问题
This is the search record:
A = {
field1: value1,
field2: value2,
...
fieldN: valueN
}
I have many such records in the database.
Other record (B) almost matches record A if even N-M fields in these records are equal. This is the example, M=2:
B = {
field1: OTHER_value1,
field2: OTHER_value2,
field3: value3,
...
fieldN: valueN
}
If can be any fields, not only the first.
I can make the very big combinatorial sql query, but may be there is more beautiful solution.
P.S.: My database is PostgreSQL.
回答1:
Such a search criteria won't be able to make use of any indexes, but it can be done...
SELECT
*
FROM
yourTable
WHERE
N-M <= CASE WHEN yourTable.field1 = searchValue1 THEN 1 ELSE 0 END
+ CASE WHEN yourTable.field2 = searchValue2 THEN 1 ELSE 0 END
+ CASE WHEN yourTable.field3 = searchValue3 THEN 1 ELSE 0 END
...
+ CASE WHEN yourTable.fieldN = searchValueN THEN 1 ELSE 0 END
Similarly, if your search criteria is in another table...
SELECT
*
FROM
yourTable
INNER JOIN
search
ON N-M <= CASE WHEN yourTable.field1 = search.field1 THEN 1 ELSE 0 END
+ CASE WHEN yourTable.field2 = search.field2 THEN 1 ELSE 0 END
+ CASE WHEN yourTable.field3 = search.field3 THEN 1 ELSE 0 END
...
+ CASE WHEN yourTable.fieldN = search.fieldN THEN 1 ELSE 0 END
(You need to populate the value of N-M
yourself)
EDIT:
A more long winded approach, that can make some use of indexes...
SELECT
id, -- your table would need to have a primary key / identity column
MAX(field1) AS field1,
MAX(field2) AS field2,
MAX(field3) AS field3,
...
MAX(fieldN) AS fieldN
FROM
(
SELECT * FROM yourTable WHERE field1 = searchValue1
UNION ALL
SELECT * FROM yourTable WHERE field2 = searchValue2
UNION ALL
SELECT * FROM yourTable WHERE field3 = searchValue3
...
SELECT * FROM yourTable WHERE fieldN = searchValueN
)
AS unioned_seeks
GROUP BY
id
HAVING
COUNT(*) >= N-M
Where you have an index on each field individually, and where you expect a relatively low number of matches for each field this might outperform the first option, at the expense of very repetitious code.
回答2:
I would do this using is not distinct from
to handle NULL
values.
You can also use Postgres short-hand to simplify the logic. One way is:
where ( (a.field1 is not distinct from b.field1)::int +
(a.field2 is not distinct from b.field2)::int +
. . .
(a.fieldn is not distinct from b.fieldn)::int +
) >= N - M
I think this is easier to express only in terms of M
. So, only look at the fields that are different:
where ( (a.field1 is distinct from b.field1)::int +
(a.field2 is distinct from b.field2)::int +
. . .
(a.fieldn is distinct from b.fieldn)::int +
) <= M
Doing this with your data requires a cross join
which is quite expensive.
来源:https://stackoverflow.com/questions/47877071/how-to-find-almost-similar-records-in-sql