How to select similar sets in SQL

后端未结

关注

 8  1738

I have the following tables:

Order
----
ID (pk)

OrderItem
----
OrderID (fk -> Order.ID)
ItemID (fk -> Item.ID)
Quantity

Item
----
ID (pk)

相关标签:

8条回答

夕颜

2020-12-13 16:05

I would try something like this for speed, listing orders by similarity to Order @OrderId. The joined INTS is supposed to be the intersection and the similarity value is my attempt to calculate the Jaccard index.

I am not using the quantity field at all here, but i think it could be done too without slowing the query down too much if we figure out a way to quantify similarity that includes quantity. Below, I am counting any identical item in two orders as a similarity. You could join on quantity as well, or use a measure where a match that includes quantity counts double. I don't know if that is reasonable.

SELECT 
    OI.OrderId,
    1.0*COUNT(INTS.ItemId) / 
    (COUNT(*)
    + (SELECT COUNT(*) FROM OrderItem WHERE OrderID = @OrderId) 
    - COUNT(INTS.ItemId)) AS Similarity
FROM    
    OrderItem OI        
JOIN
    OrderItem INTS ON INTS.ItemID = OI.ItemID AND INTS.OrderId=@OrderId
GROUP BY 
    OI.OrderId
HAVING  
    1.0*COUNT(INTS.ItemId) / 
    (COUNT(*)
    + (SELECT COUNT(*) FROM OrderItem WHERE OrderID = @OrderId) 
    - COUNT(INTS.ItemId)) > 0.85
ORDER BY
    Similarity DESC

It also presupposes that OrderId/ItemId combinations are unique in OrderItem. I realize this might not be the case, and it could be worked around using a view.

I'm sure there are better ways, but one way to weigh in quantify difference be to replace the nominator COUNT(INTS.ItemId) with something like this (supposing all quantities to be positive) that decreases the hit slowly towards 0 when the quantities differ.

    1/(ABS(LOG(OI.quantity)-LOG(INTS.quantity))+1)

Added: This more readable solution using the Tanimoto Similarity suggested by JRideout

DECLARE 
    @ItemCount INT,
    @OrderId int 
SELECT     
    @OrderId  = 1
SELECT     
    @ItemCount = COUNT(*)
FROM 
    OrderItem
WHERE 
    OrderID = @OrderId 


SELECT 
    OI.OrderId,
    SUM(1.0* OI.Quantity*INTS.Quantity/(OI.Quantity*OI.Quantity+INTS.Quantity*INTS.Quantity-OI.Quantity*INTS.Quantity )) /
    (COUNT(*) + @ItemCount - COUNT(INTS.ItemId)) AS Similarity
FROM    
    OrderItem OI        
LEFT JOIN
    OrderItem INTS ON INTS.ItemID = OI.ItemID AND INTS.OrderId=@OrderId
GROUP BY 
    OI.OrderId
HAVING      
    SUM(1.0* OI.Quantity*INTS.Quantity/(OI.Quantity*OI.Quantity+INTS.Quantity*INTS.Quantity-OI.Quantity*INTS.Quantity )) /
    (COUNT(*) + @ItemCount - COUNT(INTS.ItemId)) > 0.85
ORDER BY
    Similarity DESC

0 讨论(0)

南旧

2020-12-13 16:08
There's really no easy answer to this. You can certainly store the Jaccard index (actually I'd just store the ones that meet the criteria, and throw out the rest), but the real problem is calculating it (effectively have to scan all of your existing order each time a new order was entered in to the system to calculate the new index).

That can be quite expensive depending on your volume of orders that's you're maintaining. Maybe you only compare it to the last year of orders, or something.

If you're doing it on the fly, it gets more interesting, but still expensive.

You can readily get a list of all orders that have the same product items. One list per item. This, in fact, is not necessarily a lot of data (if you have a lot of orders for a single popular item, then it can be a long list). The individual queries aren't particularly insane either (again depending on your data). If you have a vast vast amount of data, the query can be readily map/reduced and even work with sharded data stores. Bitmap indexes (if your DB support this) are particularly good for getting lists like this quite quickly.

Then you can simply count the times that an order number occurs in all of the lists, and then drop those off that don't meet the threshold. That's a straight forward merge operation.

But you'd have to do this calculation every single time you'd want the information, since you can't really store it.

So, it really does boil down to what you need the information for, how often you need it, your items <-> order distribution, how long you can wait for it, etc.

Addenda:

Thinking about it a little more, this is a simple query, but it may take some time to run. Likely not much with modern hardware, you don't really have that much data. For a single screen viewing an order you wouldn't notice it. If you were running report across all orders, then you would definitely notice it -- and would need a different approach.

Lets consider an order with 20 line items.

And you want an 85% match. That means orders that have 17 or more items in common.

Here is a query that will give you the orders you're interested in:
```
SELECT orderId, count(*) FROM OrderItem
WHERE itemId in ('list', 'of', 'items', 'in', 'order', 123, 456, 789)
GROUP BY orderId
HAVING count(*) >= 17
```
So, this gives you a collection of all the line items with the same items as your order. Then you simply sum them up by orderId, and those that are equal to or greater than your threshold (17 in this case), are the candidate orders.

Now, you don't say how many items you have in your catalog. If you have 1000 items, perfectly distributed, this query will chew on 1600 rows of data -- which is no big deal. With proper indexes this should go quite quickly. However, if you have items that are "really popular", then you're going to chew through a lot more rows of data.

But, again, you don't have that much data. Most of this query can be done within the indexes on a proper database and not even hit the actual tables. So, as I said, you'll likely not notice the impact of this query on a interactive system.

So, give it a try and see how it goes for you.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2