How to select similar sets in SQL

后端 未结 8 1740
被撕碎了的回忆
被撕碎了的回忆 2020-12-13 15:38

I have the following tables:

Order
----
ID (pk)

OrderItem
----
OrderID (fk -> Order.ID)
ItemID (fk -> Item.ID)
Quantity

Item
----
ID (pk)
         


        
8条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-13 16:05

    I would try something like this for speed, listing orders by similarity to Order @OrderId. The joined INTS is supposed to be the intersection and the similarity value is my attempt to calculate the Jaccard index.

    I am not using the quantity field at all here, but i think it could be done too without slowing the query down too much if we figure out a way to quantify similarity that includes quantity. Below, I am counting any identical item in two orders as a similarity. You could join on quantity as well, or use a measure where a match that includes quantity counts double. I don't know if that is reasonable.

    SELECT 
        OI.OrderId,
        1.0*COUNT(INTS.ItemId) / 
        (COUNT(*)
        + (SELECT COUNT(*) FROM OrderItem WHERE OrderID = @OrderId) 
        - COUNT(INTS.ItemId)) AS Similarity
    FROM    
        OrderItem OI        
    JOIN
        OrderItem INTS ON INTS.ItemID = OI.ItemID AND INTS.OrderId=@OrderId
    GROUP BY 
        OI.OrderId
    HAVING  
        1.0*COUNT(INTS.ItemId) / 
        (COUNT(*)
        + (SELECT COUNT(*) FROM OrderItem WHERE OrderID = @OrderId) 
        - COUNT(INTS.ItemId)) > 0.85
    ORDER BY
        Similarity DESC
    

    It also presupposes that OrderId/ItemId combinations are unique in OrderItem. I realize this might not be the case, and it could be worked around using a view.

    I'm sure there are better ways, but one way to weigh in quantify difference be to replace the nominator COUNT(INTS.ItemId) with something like this (supposing all quantities to be positive) that decreases the hit slowly towards 0 when the quantities differ.

        1/(ABS(LOG(OI.quantity)-LOG(INTS.quantity))+1)  
    

    Added: This more readable solution using the Tanimoto Similarity suggested by JRideout

    DECLARE 
        @ItemCount INT,
        @OrderId int 
    SELECT     
        @OrderId  = 1
    SELECT     
        @ItemCount = COUNT(*)
    FROM 
        OrderItem
    WHERE 
        OrderID = @OrderId 
    
    
    SELECT 
        OI.OrderId,
        SUM(1.0* OI.Quantity*INTS.Quantity/(OI.Quantity*OI.Quantity+INTS.Quantity*INTS.Quantity-OI.Quantity*INTS.Quantity )) /
        (COUNT(*) + @ItemCount - COUNT(INTS.ItemId)) AS Similarity
    FROM    
        OrderItem OI        
    LEFT JOIN
        OrderItem INTS ON INTS.ItemID = OI.ItemID AND INTS.OrderId=@OrderId
    GROUP BY 
        OI.OrderId
    HAVING      
        SUM(1.0* OI.Quantity*INTS.Quantity/(OI.Quantity*OI.Quantity+INTS.Quantity*INTS.Quantity-OI.Quantity*INTS.Quantity )) /
        (COUNT(*) + @ItemCount - COUNT(INTS.ItemId)) > 0.85
    ORDER BY
        Similarity DESC
    

提交回复
热议问题