Keep two databases synchronized with timestamp / rowversion

问题

I have a primary database containing table A and a secondary database containing another copy of A. Each time my application starts it checks all the rows of table A in the primary database and updates the rows of A in secondary database.

The need for this ugly behaviour is support for a legacy database however this operation on each start is starting to be very cpu expensive. I have found out a timestamp (also called row version by Microsoft) can store when rows have been updated.

My application would need therefore to store the last timestamp of the last modified/inserted row and on successive restarts would only query the primary database for modified rows (or inserted new rows) from the database.

This would considerably speed things up, however how would I deal with deleted rows?? Thank you

EDIT: I just noticed I only access the primary database in read-only mode. I therefore cannot put a timestamp in the original database and I cannot in any way insert TRIGGERS of sort.

Is there someway I can quickly see what changed in the primary database without modifying it?

回答1:

You'd need some way to flag deleted rows for processing on the slave side. This might be a good case to use a trigger whereby when a row is deleted, you store either the whole row or maybe just the (table, id) tuple in another table - call that your new deleted_rows table.

Then when your app starts, it reads the deleted_rows table populated by your trigger and applies those changes to the slave db. Be sure to clear out deleted_rows when you're done so you don't bother trying to reprocess those records later.

回答2:

The feature you're building is supported "out of the box" by many database engines - it's called replication.

For H2, it's not an out-of-the-box feature - but there's an open source tool which appears to offer this as a feature called SymetricDS; according to the FAQ, it works with H2.

I'd consider using this, rather than your own replication scheme - it is likely to be faster, and more robust, than anything you might write yourself, unless you dedicate a LOT of time to it.

回答3:

(1) Assuming there is a primary key on table A, have a table recording only those primary keys in table B. When the application starts up, check for rows in B that are no longer in A to get deleted rows. (Vice-versa will get you new/inserted rows.)

(2) Row version (combined with the above) is indeed ideal for what you want. Failing that, some from of checksum might be used. MS SQL Server as the CHECKSUM() function, which can be used to produce a hash value based on the contents of the entire row of data. (While hash values cannot be guaranteed to be unique, they should suffice, particularly here since you'll be checking both the hash value and the primary key value, where the primary key will be used in the hash calculation.) Upon application startup, calculate the hash value for all existing rows in table A, and check them against the tracking table created above:

If primary key in new set not found in B, it's a new row, insert primary key and hash value
If primary key in B not found in new set, it's a deleted row, delete
If primary key found in B and in new set and hash value differ, row has been updated, process accordingly
If primary key found in B and in new set and hash value match, row has not been updated

Sadly, I suspect implementing the above might not save you that much time, since table A will still require a full table scan.

来源：https://stackoverflow.com/questions/15433632/keep-two-databases-synchronized-with-timestamp-rowversion

标签

sql

timestamp

data-synchronization