How to select similar sets in SQL

后端未结

关注

 8  1740

被撕碎了的回忆 2020-12-13 15:38

I have the following tables:

Order
----
ID (pk)

OrderItem
----
OrderID (fk -> Order.ID)
ItemID (fk -> Item.ID)
Quantity

Item
----
ID (pk)

8条回答

夕颜 (楼主)

2020-12-13 16:05

I would try something like this for speed, listing orders by similarity to Order @OrderId. The joined INTS is supposed to be the intersection and the similarity value is my attempt to calculate the Jaccard index.

I am not using the quantity field at all here, but i think it could be done too without slowing the query down too much if we figure out a way to quantify similarity that includes quantity. Below, I am counting any identical item in two orders as a similarity. You could join on quantity as well, or use a measure where a match that includes quantity counts double. I don't know if that is reasonable.

SELECT 
    OI.OrderId,
    1.0*COUNT(INTS.ItemId) / 
    (COUNT(*)
    + (SELECT COUNT(*) FROM OrderItem WHERE OrderID = @OrderId) 
    - COUNT(INTS.ItemId)) AS Similarity
FROM    
    OrderItem OI        
JOIN
    OrderItem INTS ON INTS.ItemID = OI.ItemID AND INTS.OrderId=@OrderId
GROUP BY 
    OI.OrderId
HAVING  
    1.0*COUNT(INTS.ItemId) / 
    (COUNT(*)
    + (SELECT COUNT(*) FROM OrderItem WHERE OrderID = @OrderId) 
    - COUNT(INTS.ItemId)) > 0.85
ORDER BY
    Similarity DESC

It also presupposes that OrderId/ItemId combinations are unique in OrderItem. I realize this might not be the case, and it could be worked around using a view.

I'm sure there are better ways, but one way to weigh in quantify difference be to replace the nominator COUNT(INTS.ItemId) with something like this (supposing all quantities to be positive) that decreases the hit slowly towards 0 when the quantities differ.

    1/(ABS(LOG(OI.quantity)-LOG(INTS.quantity))+1)

Added: This more readable solution using the Tanimoto Similarity suggested by JRideout

DECLARE 
    @ItemCount INT,
    @OrderId int 
SELECT     
    @OrderId  = 1
SELECT     
    @ItemCount = COUNT(*)
FROM 
    OrderItem
WHERE 
    OrderID = @OrderId 


SELECT 
    OI.OrderId,
    SUM(1.0* OI.Quantity*INTS.Quantity/(OI.Quantity*OI.Quantity+INTS.Quantity*INTS.Quantity-OI.Quantity*INTS.Quantity )) /
    (COUNT(*) + @ItemCount - COUNT(INTS.ItemId)) AS Similarity
FROM    
    OrderItem OI        
LEFT JOIN
    OrderItem INTS ON INTS.ItemID = OI.ItemID AND INTS.OrderId=@OrderId
GROUP BY 
    OI.OrderId
HAVING      
    SUM(1.0* OI.Quantity*INTS.Quantity/(OI.Quantity*OI.Quantity+INTS.Quantity*INTS.Quantity-OI.Quantity*INTS.Quantity )) /
    (COUNT(*) + @ItemCount - COUNT(INTS.ItemId)) > 0.85
ORDER BY
    Similarity DESC

0 讨论(0)

查看其它8个回答