PostgreSQL where all in array

后端 未结 9 1566
情书的邮戳
情书的邮戳 2020-11-30 07:23

What is the easiest and fastest way to achieve a clause where all elements in an array must be matched - not only one when using IN? After all it should behave

相关标签:
9条回答
  • 2020-11-30 08:18

    While @Alex' answer with IN and count() is probably the simplest solution, I expect this PL/pgSQL function to be the faster:

    CREATE OR REPLACE FUNCTION f_conversations_among_users(_user_arr int[])
      RETURNS SETOF conversations AS
    $BODY$
    DECLARE
        _sql text := '
        SELECT c.*
        FROM   conversations c';
        i int;
    BEGIN
    
    FOREACH i IN ARRAY _user_arr LOOP
        _sql  := _sql  || '
        JOIN   conversations_users x' || i || ' USING (conversation_id)';
    END LOOP;
    
    _sql  := _sql  || '
        WHERE  TRUE';
    
    FOREACH i IN ARRAY _user_arr LOOP
        _sql  := _sql  || '
        AND    x' || i || '.user_id = ' || i;
    END LOOP;
    
    /* uncomment for conversations with exact list of users and no more
    _sql  := _sql  || '
        AND    NOT EXISTS (
            SELECT 1
            FROM   conversations_users u
            WHERE  u.conversation_id = c.conversation_id
            AND    u.user_id <> ALL (_user_arr)
            )
    */
    
    -- RAISE NOTICE '%', _sql;
    RETURN QUERY EXECUTE _sql;
    
    END;
    $BODY$ LANGUAGE plpgsql VOLATILE;
    

    Call:

    SELECT * FROM f_conversations_among_users('{1,2}')
    

    The function dynamically builds executes a query of the form:

    SELECT c.*
    FROM   conversations c
    JOIN   conversations_users x1 USING (conversation_id)
    JOIN   conversations_users x2 USING (conversation_id)
    ...
    WHERE  TRUE
    AND    x1.user_id = 1
    AND    x2.user_id = 2
    ...
    

    This form performed best in an extensive test of queries for relational division.

    You could also build the query in your app, but I went by the assumption that you want to use one array parameter. Also, this is probably fastest anyway.

    Either query requires an index like the following to be fast:

    CREATE INDEX conversations_users_user_id_idx ON conversations_users (user_id);
    

    A multi-column primary (or unique) key on (user_id, conversation_id) is just as well, but one on (conversation_id, user_id) (like you may very well have!) would be inferior. You find a short rationale at the link above, or a comprehensive assessment under this related question on dba.SE

    I also assume you have a primary key on conversations.conversation_id.

    Can you run a performance test with EXPLAIN ANALYZE on @Alex' query and this function and report your findings?

    Note that both solutions find conversations where at least the users in the array take part - including conversations with additional users.
    If you want to exclude those, un-comment the additional clause in my function (or add it to any other query).

    Tell me if you need more explanation on the features of the function.

    0 讨论(0)
  • 2020-11-30 08:20

    I am guessing that you don't really want to start messing with temporary tables.

    Your question was unclear as to whether you want conversations with exactly the set of users, or conversations with a superset. The following is for the superset:

    with users as (select user_id from users where user_id in (<list>)
                  ),
         conv  as (select conversation_id, user_id
                   from conversations_users
                   where user_id in (<list>)
                  )
    select distinct conversation_id
    from users u left outer join
         conv c
         on u.user_id = c.user_id
    where c.conversation_id is not null
    

    For this query to work well, it assumes that you have indexes on user_id in both users and conversations_users.

    For the exact set . . .

    with users as (select user_id from users where user_id in (<list>)
                  ),
         conv  as (select conversation_id, user_id
                   from conversations_users
                   where user_id in (<list>)
                  )
    select distinct conversation_id
    from users u full outer join
         conv c
         on u.user_id = c.user_id
    where c.conversation_id is not null and u.user_id is not null
    
    0 讨论(0)
  • 2020-11-30 08:24

    Assuming the join table follows good practice and has a unique compound key defined, i.e. a constraint to prevent duplicate rows, then something like the following simple query should do.

    select conversation_id from conversations_users where user_id in (1, 2)
    group by conversation_id having count(*) = 2
    

    It's important to note that the number 2 at the end is the length of the list of user_ids. That obviously needs to change if the user_id list changes length. If you can't assume your join table doesn't contain duplicates, change "count(*)" to "count(distinct user_id)" at some possible cost in performance.

    This query finds all conversations that include all the specified users even if the conversation also includes additional users.

    If you want only conversations with exactly the specified set of users, one approach is to use a nested subquery in the where clause as below. Note, first and last lines are the same as the original query, only the middle two lines are new.

    select conversation_id from conversations_users where user_id in (1, 2)
       and conversation_id not in
       (select conversation_id from conversations_users where user_id not in (1,2))
    group by conversation_id having count(*) = 2
    

    Equivalently, you can use a set difference operator if your database supports it. Here is an example in Oracle syntax. (For Postgres or DB2, change the keyword "minus" to "except.)

    select conversation_id from conversations_users where user_id in (1, 2)
      group by conversation_id having count(*) = 2
    minus
      select conversation_id from conversations_users where user_id not in (1,2)
    

    A good query optimizer should treat the last two variations identically, but check with your particular database to be sure. For example, the Oracle 11GR2 query plan sorts the two sets of conversation ids before applying the minus operator, but skips the sort step for the last query. So either query plan could be faster depending on multiple factors such as the number of rows, cores, cache, indices etc.

    0 讨论(0)
提交回复
热议问题