First, I\'ll explain what I need to do, then how I think I can achieve it. My current plan seems very inefficient in theory, so my question is whether there is a b
I had to solve a similar problem once, maybe the solution is applicable in your case (I do not know Coldfusion much). Why not (for each source) just delete everything from table Products corresponding to that source and replacing it with Products_Temp from the same source? It assumes you can make a unique field for each source. The SQL code would look something like:
DELETE FROM Products WHERE source_id = x; INSERT INTO Products (field1, field2, ..., source_id) SELECT field1, field2, ..., x FROM Products_Temp;
Also if the source doesn't change much, you can consider making a hash after its downloading and skipping the update if it did not change to save some database access.
Both responses have possibilities. Just to expand on your options a little ..
IF mySQL supports some sort of hashing, on a per row basis, you could use a variation of comodoro's suggestion to avoid hard deletes.
Identify Changed
To identify changes, do an inner join on the primary key and check the hash values. If they are different, the product was changed and should be updated:
UPDATE Products p INNER JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
SET p.ProductName = tmp.ProductName
, p.Stock = tmp.Stock
, ...
, p.DateLastChanged = now()
, p.IsDiscontinued = 0
WHERE tmp.TheRowHash <> p.TheRowHash
Identify Deleted
Use a simple outer join to identify records that do not exist in the temp table, and flag them as "deleted"
UPDATE Products p LEFT JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
SET p.DateLastChanged = now()
, p.IsDiscontinued = 1
WHERE tmp.ProductID IS NULL
Identify New
Finally, use a similar outer join to insert any "new" products.
INSERT INTO Products ( ProductName, Stock, DateLastChanged, IsDiscontinued, .. )
SELECT tmp.ProductName, tmp.Stock, now() AS DateLastChanged, 0 AS IsDiscontinued, ...
FROM Products_Temp tmp LEFT JOIN Products p ON tmp.ProductID = p.ProductID
WHERE p.ProductID IS NULL
If per row hashing is not feasible, an alternate approach is a variation of Sharondio's suggestion.
Add a "status" column to the temp table and flag all imported records as "new", "changed" or "unchanged" through a series of joins. (The default should be "changed").
Identify UN-Changed
First use an inner join, on all fields, to identify products that have NOT changed. (Note, if your table contains any nullable fields, remember to use something like coalesce
Otherwise, the results may be skewed because null
values are not equal to anything.
UPDATE Products_Temp tmp INNER JOIN Products p ON tmp.ProductID = p.ProductID
SET tmp.Status = 'Unchanged'
WHERE p.ProductName = tmp.ProductName
AND p.Stock = tmp.Stock
...
Identify New
Like before, use an outer join to identify "new" records.
UPDATE Products_Temp tmp LEFT JOIN Products p ON tmp.ProductID = p.ProductID
SET tmp.Status = 'New'
WHERE p.ProductID IS NULL
By process of elimination, all other records in the temp table are "changed". Once you have calculated the statuses, you can update the Products table:
/* update changed products */
UPDATE Products p INNER JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
SET p.ProductName = tmp.ProductName
, p.Stock = tmp.Stock
, ...
, p.DateLastChanged = now()
, p.IsDiscontinued = 0
WHERE tmp.status = 'Changed'
/* insert new products */
INSERT INTO Products ( ProductName, Stock, DateLastChanged, IsDiscontinued, .. )
SELECT tmp.ProductName, tmp.Stock, now() AS DateLastChanged, 0 AS IsDiscontinued, ...
FROM Products_Temp tmp
WHERE tmp.Status = 'New'
/* flag deleted records */
UPDATE Products p LEFT JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
SET p.DateLastChanged = now()
, p.IsDiscontinued = 1
WHERE tmp.ProductID IS NULL
For finding the changes, I'd look at joins based on the fields you want to match on. This can be slow, depending on the number of fields and whether or not they're indexed, but I'd still say it was faster than loops. Something along the lines of:
SELECT product_id
FROM Products
WHERE product_id NOT IN (
SELECT T.product_id
FROM Products_Temp T
INNER JOIN PRODUCTS P
ON (
P.field1 = T.field1
AND P.field2 = T.field2
...
)
)
For the missing products to find the non-matches:
SELECT P.product_id
FROM Products P
LEFT OUTER JOIN Products_Temp T
ON (P.field1 = T.field1
AND P.field2 = T.field2
...)
WHERE T.product_id IS NULL