Group by repeating attribute

≯℡__Kan透↙ 提交于 2019-12-17 21:16:26

问题


Basically I have a table messages, with user_id field that identifies a user that created the message.

When I display a conversation(set of messages) between two users, I want to be able to group the messages by user_id, but in a tricky way:

Let's say there are some messages (sorted by created_at desc):

  id: 1, user_id: 1
  id: 2, user_id: 1
  id: 3, user_id: 2
  id: 4, user_id: 2
  id: 5, user_id: 1

I want to get 3 message groups in the below order: [1,2], [3,4], [5]

It should group by *user_id* until it sees a different one and then groups by that one.

I'm using PostgreSQL and would be happy to use something specific to it, whatever would give the best performance.


回答1:


Proper SQL

@Igor presents a nice pure-SQL technique with window functions.
However:

I want to get 3 message groups in the below order: [1,2], [3,4], [5]

To get the requested order, add ORDER BY min(id):

SELECT array_agg(id) AS ids
FROM (
   SELECT id
         ,user_id
         ,row_number() OVER (ORDER BY id) -
          row_number() OVER (PARTITION BY user_id ORDER BY id) AS grp
   FROM   messages
   ORDER  BY id) t   -- for ordered arrays in result
GROUP  BY grp, user_id
ORDER  BY min(id);

SQL Fiddle.

The addition would barely warrant another answer. The more important issue is this:

Faster with PL/pgSQL

I'm using PostgreSQL and would be happy to use something specific to it, whatever would give the best performance.

Pure SQL is all nice and shiny, but a procedural server-side function is much faster for this task. While processing rows procedurally is generally slower, plpgsql wins this competition big-time, because it can make do with a single table scan and a single ORDER BY operation:

CREATE OR REPLACE FUNCTION f_msg_groups()
  RETURNS TABLE (ids int[]) AS
$func$
DECLARE
   _id    int;
   _uid   int;
   _id0   int;                         -- id of last row
   _uid0  int;                         -- user_id of last row
BEGIN
   FOR _id, _uid IN
       SELECT id, user_id FROM messages ORDER BY id
   LOOP
       IF _uid <> _uid0 THEN
          RETURN QUERY VALUES (ids);   -- output row (never happens after 1 row)
          ids := ARRAY[_id];           -- start new array
       ELSE
          ids := ids || _id;           -- add to array
       END IF;

       _id0  := _id;
       _uid0 := _uid;                  -- remember last row
   END LOOP;

   RETURN QUERY VALUES (ids);          -- output last iteration
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_msg_groups();

Benchmark and links

I ran a quick test with EXPLAIN ANALYZE on a similar real life table with 60k rows (execute several times, pick fastest result to exclude cashing effects):

SQL:
Total runtime: 1009.549 ms
Pl/pgSQL:
Total runtime: 336.971 ms

Also consider these closely related questions:

  • GROUP BY and aggregate sequential numeric values
  • GROUP BY consecutive dates delimited by gaps
  • Ordered count of consecutive repeats / duplicates



回答2:


Try something like this:

SELECT user_id, array_agg(id)
FROM (
SELECT id, 
       user_id, 
       row_number() OVER (ORDER BY created_at)-
       row_number() OVER (PARTITION BY user_id ORDER BY created_at) conv_id
FROM table1 ) t
GROUP BY user_id, conv_id;

The expression:

row_number() OVER (ORDER BY created_at)-
row_number() OVER (PARTITION BY user_id ORDER BY created_at) conv_id

Will give you a special id for every message group (this conv_id can be repeated for other user_id, but user_id, conv_id will give you all distinct message groups)

My SQLFiddle with example.

Details: row_number(), OVER (PARTITION BY ... ORDER BY ...)




回答3:


The GROUP BY clause will collapse the response in 2 records - one with user_id 1 and one with user_id 2 no matter of the ORDER BY clause so I recommend you'd send just the ORDER BY created_at

prev_id = -1
messages.each do |m|
 if ! m.user_id == prev_id do 
    prev_id = m.user_id
    #do whatever you want with a new message group
 end
end



回答4:


You can use chunk:

Message = Struct.new :id, :user_id

messages = []
messages << Message.new(1, 1)
messages << Message.new(2, 1)
messages << Message.new(3, 2)
messages << Message.new(4, 2)
messages << Message.new(5, 1)

messages.chunk(&:user_id).each do |user_id, records| 
  p "#{user_id} - #{records.inspect}" 
end

The output:

"1 - [#<struct Message id=1, user_id=1>, #<struct Message id=2, user_id=1>]"
"2 - [#<struct Message id=3, user_id=2>, #<struct Message id=4, user_id=2>]"
"1 - [#<struct Message id=5, user_id=1>]"


来源:https://stackoverflow.com/questions/14010348/group-by-repeating-attribute

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!