问题
This question originates from a discussion on whether to use SQL ranking functionality or not in a particular case.
Any common RDBMS includes some ranking functionality, i.e. it's query language has elements like TOP n ... ORDER BY key
, ROW_NUMBER() OVER (ORDER BY key)
, or ORDER BY key LIMIT n
(overview).
They do a great job in increasing performance if you want to present only a small chunk out of a huge number of records. But they also introduce a major pitfall: If key
is not unique results are non-deterministic. Consider the following example:
users
user_id name
1 John
2 Paul
3 George
4 Ringo
logins
login_id user_id login_date
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
A query is supposed to return the person who logged in last:
SELECT TOP 1 users.*
FROM
logins JOIN
users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC
Just as expected George
is returned and everything looks fine. But then a new record is inserted into logins
table:
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
5 4 2009-08-20
What does the query above return now? Ringo
? George
? You can't tell. As far as I remember e.g. MySQL 4.1 returns the first record physically created that matches the criteria, i.e. the result would be George
. But this may vary from version to version and from DBMS to DBMS. What should have been returned? One might say Ringo
since he apparently logged in last but this is pure interpretation. In my opinion both should have been returned, because you can't decide unambiguously from the data available.
So this query matches the requirements:
SELECT users.*
FROM
logins JOIN
users ON
logins.user_id = users.user_id AND
logins.login_date = (
SELECT max(logins.login_date)
FROM
logins JOIN
users ON logins.user_id = users.user_id)
As an alternative some DBMSs provide special functions (e.g. Microsoft SQL Server 2005 introduces TOP n WITH TIES ... ORDER BY key
(suggested by gbn), RANK
, and DENSE_RANK
for this very purpose).
If you search SO for e.g. ROW_NUMBER
you'll find numerous solutions which suggest using ranking functionality and miss to point out the possible problems.
Question: What advice should be given if a solution that includes ranking functionality is proposed?
回答1:
rank
and row_number
are fantastic functions that should be used more liberally, IMO. Folks just don't know about them.
That being said, you need to make sure what you're ranking by is unique. Have a backup plan for duplicates (esp. dates). The data you get back is only as good as the data you put in.
I think the pitfalls here are the exact same in the query:
select top 2 * from tblA order by date desc
You need to be aware of what you're ordering on and ensure that there is some way to always have a winner. If not, you get a (potentially) random two rows with the max date.
Also, for the record, SQL Server does not store rows in the physical order that they are inserted. It stores records on 8k pages and orders those pages in the most efficient way it can according to the clustered index on the table. Therefore, there is absolutely no guarantee of order in SQL Server.
回答2:
Use the WITH TIES clause in your example above
SELECT TOP 1 WITH TIES users.*
FROM
logins JOIN
users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC
Use DENSE_RANK as you mentioned
Not put myself in this position Example: Store time too (datetime) and accept the very low risk of a very rare duplicate in the same 3.33 millisecond instant (SQL 2008 is different)
回答3:
Every database engine uses some kind of a row identifier so that it can distinguish between two rows.
These identifiers are:
- Row pointer in
MyISAM
- Primary key in
InnoDB
table with aPRIMARY KEY
defined Uniquifier
inInnoDB
table without aPRIMARY KEY
definedRID
inSQL Server
's heap table- Primary key in
SQL Server
's table clustered onPRIMARY/UNIQUE KEY
- Index key +
uniquifier
inSQL Server
's table clustered on a non-unique key ROWID
/UROWID
inOracle
CTID
inPostgreSQL
.
You don't have an immediate access to the following ones:
- Row pointer in
MyISAM
Uniquifier
inInnoDB
table without aPRIMARY KEY
definedRID
inSQL Server
's heap table- Index key +
uniquifier
inSQL Server
's table clustered on a non-unique key
Besides, you don't have control over the following ones:
ROWID
/UROWID
inOracle
CTID
inPostgreSQL
.
(they can change on updates or restoring from backups)
If two rows are identical in these tables, that means they should be identical from the application's point of view.
They return exactly same results and can be treated as an ultimate uniquifier.
This just means you should always include some kind of a uniquifier you have full control over to the ordering clause to keep your ordering consistent.
If your table has a primary or unique key (even composite), include it into the ordering condition:
SELECT *
FROM mytable
ORDER BY
ordering_column, pk
Otherwise, include all columns into the ordering condition:
SELECT *
FROM mytable
ORDER BY
ordering_column, column1, ..., columnN
The later condition will always return any of the otherwise indistinguishable rows, but since they're indistinguishable anyway, it will look consistent from your applications's point of view.
That, by the way, is another good reason for always having a PRIMARY KEY
in your tables.
But do not rely on ROWID
/ CTID
to order rows.
It can easily change on UPDATE
so your result order will not be stable anymore.
回答4:
ROW_NUMBER is a fantastic tool indeed. If misused it can provide non-deterministic results, but so will the other SQL functions. You can have ORDER BY return non-deterministic results as well.
Just know what you are doing.
回答5:
This is the summary:
- Use your head first. Should be obvious, but it is always a good point to start. Do you expect
n
rows exactly or do you expect a possibly varying number of rows that fulfill a constraint? Reconsider your design. If you're expectingn
rows exactly, your model might be designed poorly if it's impossible to identify a row unambiguously. If you expect a possibly varying number of rows, you might need to adjust your UI in order to present your query results. - Add columns to
key
that make it unique (e.g. PK). You at least gain back control on the returned result. There is almost always a way to do this as Quassnoi pointed out. - Consider using possibly more suitable functions like
RANK
,DENSE_RANK
andTOP n WITH TIES
. They are available in Microsoft SQL Server by 2005 version and in PosgreSQL from 8.4 onwards. If these functions are not available, consider using nested queries with aggregation instead of ranking functions.
来源:https://stackoverflow.com/questions/1305325/should-sql-ranking-functionality-be-considered-as-use-with-caution