Limit number of rows per group from join (NOT to 1 row)

问题

Given these tables:

TABLE Stores (
 store_id INT,
 store_name VARCHAR,
 etc
);

TABLE Employees (
 employee_id INT,
 store_id INT,
 employee_name VARCHAR,
 currently_employed BOOLEAN,
 etc
);

I want to list the 15 longest-employed employees for each store (let's say the 15 with lowest employee_id), or ALL employees for a store if there are 15 who are currently_employed='t'. I want to do it with a join clause.

I've found a lot of examples of people doing this only for 1 row, usually a min or max (single longest-employed employee), but I want to basically do combine an ORDER BY and a LIMIT inside of the join. Some of those examples can be found here:

Limit results from joined table to one row
MySQL returning 1 image for each product

I've also found decent examples for doing this store-by-store (I don't, I have about 5000 stores):

Get top n records for each group of grouped results

I've also seen that you can use TOP instead of ORDER BY and LIMIT, but not for PostgreSQL.

I reckon that a join clause between the two tables isn't the only (or even necessarily best way) to do this, if it's possible to just work by distinct store_id inside of the employees table, so I'd be open to other approaches. Can always join afterwards.

As I'm very new to SQL, I'd like any theory background or additional explanation that can help me understand the principles at work.

回答1:

row_number()

The general solution to get the top n rows per group is with the window function row_number():

SELECT *
FROM  (
   SELECT *, row_number() OVER (PARTITION BY store_id ORDER BY employee_id) AS rn
   FROM   employees
   WHERE  currently_employed
   ) e
JOIN   stores s USING (store_id)
WHERE  rn <= 15
ORDER  BY store_id, e.rn;

PARTITION BY should use store_id, which is guaranteed to be unique (as opposed to store_name).
First identify rows in employees, then join to stores, that's cheaper.
To get 15 rows use row_number() not rank() (would be the wrong tool for the purpose). As long as employee_id is unique, the difference doesn't show.

LATERAL

An alternative for Postgres 9.3+ that typically performs better in combination with a matching index, especially when retrieving a small selection from a big table.

What is the difference between LATERAL and a subquery in PostgreSQL?

SELECT s.store_name, e.*
FROM   stores s
, LATERAL (
   SELECT *  -- or just needed columns
   FROM   employees
   WHERE  store_id = s.store_id
   AND    currently_employed
   ORDER  BY employee_id
   LIMIT  15
   ) e
-- WHERE ... possibly select only a few stores
ORDER  BY s.store_name, e.store_id, e.employee_id

The perfect index would be a partial multicolumn index like this:

CREATE INDEX ON employees (store_id, employee_id) WHERE  currently_employed

Details depend on missing details in the question. Related example:

Create unique constraint with null columns

Both versions exclude stores without current employees. There are ways around this if you need it ...

回答2:

A classic way of doing this would be with a window function, such as rank:

SELECT employee_name, store_name
FROM   (SELECT employee_name, store_name, 
        RANK() OVER (PARTITION BY store_name ORDER BY employee_id ASC) AS rk
        FROM   employees e
        JOIN   stores s ON e.store_id = s.store_id) t
WHERE  rk <= 15

来源：https://stackoverflow.com/questions/30768144/limit-number-of-rows-per-group-from-join-not-to-1-row

标签

sql

postgresql

join

greatest-n-per-group

sql-limit