问题
I'm designing a simple messaging schema where a thread groups all messages that are sent between a collection of users. I'm getting stuck when I have to find an existing thread given a set of users.
There are 2 scenarios for sending a message:
Send To Thread: When viewing a thread, a message is sent directly to that thread, so the threadID is known. (not a problem)
Send To Recipients: A user creates a new message and specifies a set of recipients from scratch. I only want to create a new thread if one doesn't already exist between these users, which is where I'm stuck. I need a query that will find an existing threadID given a set of users. The ThreadMembers table maps users to threads. Is this even possible? Or do I need to modify my tables?
My tables:
Thread:
threadID (id)
lastSent (timestamp)
ThreadMembers:
threadFK (foreign key to thread)
userFK (foreign key to user)
Messages:
threadFK (foreign key to thread)
senderFK (foreign key to user)
msgID (id)
msgDate (timestamp)
msgText (text)
Thank you very much!
回答1:
EDIT:
I realized, in the course of attempting to explain the query, that it wouldn't always work correctly. So, I went back and figured out how to test this. I'm still bugged by the schema setup - namely, it implies that new users can't be added to an existing thread, and that a specific set of users will only be able to talk in one thread - but it was good to correct the query.
WITH Selected_Users(id) as (VALUES (@id1), (@id2), --etc--),
Threads(id) as (SELECT DISTINCT threadFk
FROM ThreadMembers as a
JOIN Selected_Users as b
ON b.id = a.userFk)
SELECT a.id
FROM Threads as a
WHERE NOT EXISTS (SELECT '1'
FROM ThreadMembers as b
LEFT JOIN Selected_Users as c
ON c.id = b.userFk
WHERE c.id IS NULL
AND b.threadFk = a.id)
AND NOT EXISTS (SELECT '1'
FROM Selected_Users as b
LEFT JOIN ThreadMembers as c
ON c.userFk = b.id
AND c.threadFk = a.id
WHERE c.userFk IS NULL)
The statement will likely have to be dynamic, to build the list of selected users, unless SQL Server has a way to provide a list as a host variable (I know DB2 does, at least from the iSeries). I don't have the perfect dataset to test this against, but against a multi-million row table (with only a many-one relationship), it returns almost instantly - I'm getting index-only access for this (hint hint).
Explanations:
WITH Selected_Users(id) as (VALUES (@id1), (@id2), --etc--),
This CTE is building the list of users so that it can be referenced as a table. This makes it easiest to deal with, although it would be possible to simply replaces it with an IN statement everywhere (requires multiple references, though).
Threads(id) as (SELECT DISTINCT threadFk
FROM ThreadMembers as a
JOIN Selected_Users as b
ON b.id = a.userFk)
This CTE gets the list of (distinct) threads that the users are involved in. Mostly, this is just to chop the listing down to single references to threadFk.
SELECT a.id
FROM Threads as a
... Get the selected set of threads ...
WHERE NOT EXISTS (SELECT '1'
FROM ThreadMembers as b
LEFT JOIN Selected_Users as c
ON c.id = b.userFk
WHERE c.id IS NULL
AND b.threadFk = a.id)
Where there isn't anybody 'missing' from the selected list of users - that is, it eliminates threads with user-lists that are subsets of a larger one. It also eliminates threads that have some of the users listed from the selection, but also a few that aren't, meaning that the counts of the users would match, but the actual users would not (this is where my first version failed).
EDIT:
I realized that, while the existing statement takes care of the situation where the provided list of users is a subset of users listed for a given thread, I didn't take care of the situation where the list of selected users contains a subset that is the list of users for the given thread.
AND NOT EXISTS (SELECT '1'
FROM Selected_Users as b
LEFT JOIN ThreadMembers as c
ON c.userFk = b.id
AND c.threadFk = a.id
WHERE c.userFk IS NULL)
This clause fixes that. It makes sure that there aren't any leftover users in the selection list, after excluding users for a particular thread.
The statement is now bugging me a bit - there may be a slightly better way for me to do this...
EDIT:
Muwahaha, there is a COUNT(*) version, which should also be faster:
WITH Selected_Users(id) as (VALUES (@id1), (@id2), --etc--),
SELECT a.threadFk
FROM ThreadMembers as a
JOIN Selected_Users as b
ON b.id = a.userFk
GROUP BY a.threadFk
HAVING COUNT(*) = (SELECT COUNT(*) FROM Selected_Users)
AND COUNT(*) = (SELECT COUNT(*) from ThreadMembers as c
WHERE c.threadFk = a.threadFk)
Explanations:
SELECT a.threadFk
FROM ThreadMembers as a
JOIN Selected_Users as b
ON b.id = a.userFk
This is joining to get all threads the listed members are a part of. This is the inside equivalent to the Threads CTE above. Actually, you could remove that CTE in the above query, too.
GROUP BY a.threadFk
We only want one instance of a given thread after all. Also (in DB2 at least), the rest of the statement isn't valid unless it's present.
HAVING COUNT(*) = (SELECT COUNT(*) FROM Selected_Users)
Verify that, for the given thread, all of the selected users are present. Or, all of the selected users must be present in the given thread.
AND COUNT(*) = (SELECT COUNT(*) from ThreadMembers as c
WHERE c.threadFk = a.threadFk)
Verify that, for the given thread, there are no non-selected users. Or, there must not be any users 'left out'
You should get index-only access for this (I seem to be). The COUNT(*) of the result rows (for the GROUP BY) should only be performed once, and reused. The HAVING clause is evaluated after the GROUP BY takes place (if I recall correctly), so the sub-select for the count from the original table should only take place once per threadFk.
回答2:
I don't recommend this lightly, but I think you'd be better off denormalizing slightly by adding a column to Thread that contains a comma-separated sorted list of foreign keys to User. And indexing that column. Then your application just has to sort the user-IDs of the sender + all recipients, join the sorted list with commas, and look up the Thread record.
Since — by definition — the list of users in a thread never changes, you just need to populate these things correctly on insert, and you don't have to worry about later updates being consistent.
(To be clear: what you describe is definitely possible with a properly normalized schema. But it will be ugly, and I think it will perform poorly.)
回答3:
Is it correct to say you are interested whether any thread exists that: 1) has the same count in threadmembers, when grouped by threadFK, as the number of members of the group you are interested in, 2) has and link to each member? If so, I think a solution will follow from there (so this is a proposed answer). Exact mechanics would vary with what brand of database you are using, oracle, postgres or sql server probably would be simpler than other brands. How do you want to call the thing, as a stored procedure that takes a table of users, a list of user names, and returns, what, the key if there's a match, or NULL?
回答4:
Here's an example of the answer (answer number 1) using MS SQL Server 2008. This assumes that table: MessageThreadUsers (threadFK - int, userFK - varchar) is defined (your key types may be different):
DELETE FROM MessageThreadUsers
GO
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (1, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (1, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (2, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (2, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (2, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (3, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (3, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (3, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (3, 'user4')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user4')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user5')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user4')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user5')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user6')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (6, 'user6')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (6, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (6, 'user1')
GO
WITH Selected_Users (id) AS (
SELECT 'user3' UNION
SELECT 'user1' UNION
SELECT 'user6'
)
SELECT a.threadFk
FROM MessageThreadUsers as a
JOIN Selected_Users as b
ON b.id = a.userFk
GROUP BY a.threadFk
HAVING COUNT(*) = (SELECT COUNT(*) FROM Selected_Users)
AND COUNT(*) = (SELECT COUNT(*) from MessageThreadUsers as c
WHERE c.threadFk = a.threadFk)
来源:https://stackoverflow.com/questions/9911003/sql-message-schema-need-to-find-an-existing-message-thread-given-a-set-of-us