Why does SQL standard allow duplicate rows?

问题

One of the core rules for the relational model is the required uniqueness for tuples (rows):

Every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.

In a SQL world, that would mean that there could never exist two rows in a table for which all the column values were equal. If there was no meaningful way to guarantee uniqueness, a surrogate key could be presented to the table.

When the first SQL standard was released, it defined no such restriction and it has been like this ever since. This seems like a root for all kind of evil.

Is there any meaningful reason why it was decided to be that way? In a practical world, where could an absence of such restriction prove to be useful? Does it outweigh the cons?

回答1:

You're assuming that databases are there solely for storing relational data; that's certainly not what they're used for because practical considerations will always win.

A obvious example where there's no need for a primary key would be a "state" log of some description (weather/database/whatever). If you're never going to query a single value from this table you may not want to have a primary key in order to avoid having to wait for an insert into the key. If you have a use-case to pick up a single value from this table then sure, this would be a bad solution, but some people just don't need that. You can always add a surrogate key afterwards if it becomes absolutely necessary.

Another example would be a write intensive application needs to tell another process to do something. This secondary process runs every N minutes/hours/whatever. Doing the de-duplication on N million records as a one off is quicker than checking for uniqueness on every insert into the table (trust me).

What are sold as relational databases are not being used solely as relational databases. They're being used as logs, key-value stores, graph databases etc. They may not have all the functionality of the competition but some do and it's often simpler to have a single table that doesn't fit your relational model than to create a whole other database and suffer the data-transfer performance penalties.

tl;dr People aren't mathematically perfect and so won't always use the mathematically perfect method of doing something. Committees are made up of people and can realise this, sometimes.

回答2:

The short answer is that SQL is not relational and SQL DBMSs are not relational DBMSs.

Duplicate rows are a fundamental part of the SQL model of data because the SQL language doesn't really try to implement the relational algebra. SQL uses a bag (multiset)-based algebra instead. The results of queries and other operations in relational algebra are relations that always have distinct tuples, but SQL DBMSs don't have the luxury of dealing only with relations. Given this fundamental "feature" of the SQL language, SQL database engines need to have mechanisms for processing and storing duplicate rows.

Why was SQL designed that way? One reason seems to be that the relational model was just too big a leap of faith to make at that time. The relational model was an idea well ahead of its time. SQL on the other hand, was and remains very much rooted in the systems of three decades ago.

回答3:

The very first versions of the language did not have any form of constraints, including keys. So uniqueness could simply not be enforced. When support for constraints (keys in particular) was later added to the language, operational systems had already been written, and nobody wanted to break backward compatibility. So it (allowing duplicates) has been there ever since.

Many neat little topics of historical background, just like this one, can be found in Hugh Darwen's book "SQL : A comparative survey" (freely available from bookboon).

(EDIT : presumably the reason why there was no support for constraints in the very first versions of the language, was that at the time, Codd's main vision was that the query language would effectively be a query (i.e. read-only) language, and that the "relational" aspect of the DBMS would be limited to having a "relational wrapper layer" over existing databases which were not relational in structure. In that perspective, there is no question of "updating" in the language itself, hence no need to define constraints, because those were defined and enforced in the "existing, non-relational database". But that approach was abandoned pretty early on.)

回答4:

Although that is normally how tables work, it's not practical to have it as a rule.

To follow the rule, a table must always have a primary key. That means that you can't just remove the primary key on a table and then add a different one. You would need to both changes at once, so that the table never is without a primary key.

回答5:

In a SQL world, that would mean that there could never exist two rows in a table for which all the column values are equal and that's true. unless all the attributes of that tuple matches with another, it's not a duplicate one even if it dose differ only by the primary key column.

That's why we should define other key (unique key) column(s) along with the primary key to identify each record as unique.

来源：https://stackoverflow.com/questions/30767562/why-does-sql-standard-allow-duplicate-rows

标签

sql

relational-database