问题
I've got multiple identical databases (distributed on several servers) and need to gather them to one single point to do data mining, etc.
The idea is to take Table1
, Table2
, ..., TableN
from each database and merge them and put the result into one single big database.
To be able to write queries, and to know from which database each row came from we will add a single column DatabaseID
to target table, describing where the row came from.
Editing the source tables is not an option, it belongs to some proprietary software.
We've got ~40 servers, ~170 databases and need to copy ~40 tables.
Now, how should we implement this given that it should be:
- Easy to setup
- Easy to maintain
- Preferably easy to adjust if database schema changes
- Reliable, logging/alarm if something fails
- Not too hard to add more tables to copy
We've looked into SSIS, but it seemed that we would have to add each table as a source/transformation/destination. I'm guessing it would also be quite tied to the database schema. Right?
Another option would be to use SQL Server Replication, but I don't see how to add the DatabaseID
column to each table. It seems it's only possible to copy data, not modify it.
Maybe we could copy all the data into separate databases, and then to run a local job on the target server to merge the tables?
It also seems like a lot of work if we'd need to add more tables to copy, as we'd have to redistribute new publications for each database (manual work?).
Last option (?) is to write a custom application to our needs. Bigger time investment, but it'd at least do precisely what we'd like.
To make it worse... we're using Microsoft SQL Server 2000. We will upgrade to SQL Server 2008 R2 within 6 months, but we'd like the project to be usable sooner.
Let me know what you guys think!
UPDATE 20110721
We ended up with a F# program opening a connection to the SQL Server where we would like the aggregated databases. From there we query the 40 linked SQL Servers to fetch all rows (but not all columns) from some tables, and add an extra row to each table to say which DatabaseID the row came from. Configuration of servers to fetch from, which tables and which columns, is a combination of text file configuration and hard coded values (heh :D). It's not super fast (sequential fetching so far) but it's absolutely manageable, and the data processing we do afterwards takes far longer time.
Future improvements could be to;
- improve error handling if it turns out to be a problem (if a server isn't online, etc).
- implement parallel fetching, to reduce the total amount of time to finish fetching.
- figure out if it's enough to fetch only some of the rows, like only what's been added/updated.
All in all it turned out to be quite simple, no dependencies to other products, and it works well in practice.
回答1:
Nothing fancy but couldn't you do something like
DROP TABLE dbo.Merged
INSERT INTO dbo.Merged
SELECT [DatabaseID] = "Database1", * FROM ServerA.dbo.Table
UNION ALL SELECT [DatabaseID] = "Database2", * FROM ServerB.dbo.Table
...
UNION ALL SELECT [DatabaseID] = "DatabaseX", * FROM ServerX.dbo.Table
Advantages
- Easy to setup
- Easy to maintain
- Easy to adjust
- Easy to add more tables
Disadvantages
- Performance
- Reliable logging
回答2:
We had a similar requirement where we took a different approach. first created a central database to collect the data. Then we created a inventory table to store the list of target servers / databases. Then a small vb.net based CLR procedure which take the path of SQL query, target SQL Instance name and the target table which will store the data(This would eliminate the setup of linked server when new targets are added). This also adds two additional columns to the result set. The Target server name and the timestamp when the data is captured.
Then we set up a service broker queue/service and pushed list of target servers to interogate.
The above CLR procedure is wrapped in another procedure which dequeues the message, executes the SQL on the target server provided. The wrapper procedure is then configured as the activated procedure for the queue.
With this we are able to achieve a bit of parallelism to capture the data.
Advantages :
- Easy to setup Easy to manage (Add / Remove targets)
- Same framework works for multiple queries
- Logging tables to check for failed queries.
- Works independent of each target, so if one of the target fails to respond, others still continue.
- Workflow can be pause gracefully by disabling the queue (for maintenance on central server) and then resume collection be re-enabling it.
Disadvantage:
- requires good understanding of service brokers.
- should properly handle poison messages.
Please Let me know if it helps
来源:https://stackoverflow.com/questions/4669305/copy-multiple-tables-into-one-table-from-multiple-databases