Safely normalizing data via SQL query

Suppose I have a table of customers:

CREATE TABLE customers (
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

This table does not have a primary key. However, customer_name and customer_address should be unique for any given customer_number.

It is not uncommon for this table to contain many duplicate customers. To get around this duplication, the following query is used to isolate only the unique customers:

SELECT
  DISTINCT customer_number, customer_name, customer_address
FROM customers

Fortunately, the table has traditionally contained accurate data. That is, there has never been a conflicting customer_name or customer_address for any customer_number. However, suppose conflicting data did make it into the table. I wish to write a query that will fail, rather than returning multiple rows for the customer_number in question.

For example, I tried this query with no success:

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number

Is there a way to write such a query using standard SQL? If not, is there a solution in Oracle-specific SQL?

EDIT: The rationale behind the bizarre query:

Truth be told, this customers table does not actually exist (thank goodness). I created it hoping that it would be clear enough to demonstrate the needs of the query. However, people are (fortunately) catching on that the need for such a query is the least of my worries, based on that example. Therefore, I must now peel away some of the abstraction and hopefully restore my reputation for suggesting such an abomination of a table...

I receive a flat file containing invoices (one per line) from an external system. I read this file, line-by-line, inserting its fields into this table:

CREATE TABLE unprocessed_invoices (
    invoice_number   INTEGER,
    invoice_date     DATE,
    ...
    // other invoice columns
    ...
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

As you can see, the data arriving from the external system is denormalized. That is, the external system includes both the invoice data and its associated customer data on the same line. It is possible that multiple invoices will share the same customer, therefore it is possible to have duplicate customer data.

The system cannot begin processing the invoices until all customers are guaranteed to be registered with the system. Therefore, the system must identify the unique customers and register them as necessary. This is why I wanted the query: because I was working with denormalized data I had no control over.

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number

Hopefully this helps clarify the original intent of the question.

EDIT: Examples of good/bad data

To clarify: customer_name and customer_address only have to be unique for a particular customer_number.

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 3               | 'Fred'        | '456 Avenue'
 3               | 'Fred'        | '789 Crescent'

The first two rows are fine because it is the same customer_name and customer_address for customer_number 1.

The middle two rows are fine because it is the same customer_name and customer_address for customer_number 2 (even though another customer_number has the same customer_name and customer_address).

The last two rows are not okay because there are two different customer_addresses for customer_number 3.

The query I am looking for would fail if run against all six of these rows. However, if only the first four rows actually existed, the view should return:

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'

I hope this clarifies what I meant by "conflicting customer_name and customer_address". They have to be unique per customer_number.

I appreciate those that are explaining how to properly import data from external systems. In fact, I am already doing most of that already. I purposely hid all the details of what I'm doing so that it would be easier to focus on the question at hand. This query is not meant to be the only form of validation. I just thought it would make a nice finishing touch (a last defense, so to speak). This question was simply designed to investigate just what was possible with SQL. :)

A scalar sub-query must only return one row (per result set row...) so you could do something like:

select distinct
       customer_number,
       (
       select distinct
              customer_address
         from customers c2
        where c2.customer_number = c.customer_number
       ) as customer_address
  from customers c

Your approach is flawed. You do not want data that was successfully able to be stored to then throw an error on a select - that is a land mine waiting to happen and means you never know when a select could fail.

What I recommend is that you add a unique key to the table, and slowly start modifying your application to use this key rather than relying on any combination of meaningful data.

You can then stop caring about duplicate data, which is not really duplicate in the first place. It is entirely possible for two people with the same name to share the same address.

You will also gain performance improvements from this approach.

As an aside, I highly recommend you normalize your data, that is break up the name into FirstName and LastName (optionally MiddleName too), and break up the address field into separate fields for each component (Address1, Address2, City, State, Country, Zip, or whatever)

Update: If I understand your situation correctly (which I am not sure I do), you want to prevent duplicate combinations of name and address from ever occurring in the table (even though that is a possible occurrence in real life). This is best done by a unique constraint or index on these two fields to prevent the data from being inserted. That is, catch the error before you insert it. That will tell you the import file or your resulting app logic is bad and you can choose to take the appropriate measures then.

I still maintain that throwing the error when you query is too late in the game to do anything about it.

Making the query fail may be tricky...

This will show you if there are any duplicate records in the table:

select customer_number, customer_name, customer_address
from customers
group by customer_number, customer_name, customer_address
having count(*) > 1

If you just add a unique index for all the three fields, noone can create a duplicate record in the table.

The defacto key is Name+Address, so that's what you need to group by.

SELECT
  Customer_Name,
  Customer_Address,
  CASE WHEN Count(DISTINCT Customer_Number) > 1
    THEN 1/0 ELSE 0 END as LandMine
FROM Customers
GROUP BY Customer_Name, Customer_Address

If you want to do it from the point of view of a Customer_Number, then this is good too.

SELECT *, 
CASE WHEN Exists((
  SELECT top 1 1
  FROM Customers c2
  WHERE c1.Customer_Number != c2.Customer_Number
    AND c1.Customer_Name = c2.Customer_Name
    AND c1.Customer_Address = c2.Customer_Address
)) THEN 1/0 ELSE 0 END as LandMine
FROM Customers c1
WHERE Customer_Number = @Number

If you have dirty data, I would clean it up first.

Use this to find the duplicate customer records...

Select * From customers
Where customer_number in 
  (Select Customer_number from customers
  Group by customer_number Having count(*) > 1)

If you want it to fail you're going to need to have an index. If you don't want to have an index, then you can just create a temp table to do this all in.

CREATE TABLE #temp_customers 
    (customer_number int, 
    customer_name varchar(50), 
    customer_address varchar(50),
    PRIMARY KEY (customer_number),
     UNIQUE(customr_name, customer_address))

)

INSERT INTO #temp_customers
SELECT DISTINCT customer_number, customer_name, customer_address
FROM customers

SELECT customer_number, customer_name, customer_address
FROM #temp_customers

DROP TABLE #temp_customers

This will fail if there are issues but will keep your duplicate records from causing issues.

Let's put the data into a temp table or table variable with your distinct query

select distinct customer_number, customer_name, customer_address, 
  IDENTITY(int, 1,1) AS ID_Num
into #temp 
from unprocessed_invoices

Personally I would add an indetity to unporcessed invoices if possible as well. I never do an import without creating a staging table that has an identity column just because it is easier to delete duplicate records.

Now let's query the table to find your problem records. I assume you would want to see what is causing the problem not just fail them.

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number

You can use a variation on these queries to delete the problem records from #temp (depends on if you choose to keep one or delete all possible problems) and then insert from #temp to your production table. You can also porvide the problem records back to whoever is providing you data to be fixed at their end.

Vasim Sayyad

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number

来源：https://stackoverflow.com/questions/987893/safely-normalizing-data-via-sql-query

标签

sql

denormalization