Azure DataFactory Incremental BLOB copy

问题

I've made a pipeline to copy data from one blob storage to another. I want to have incremental copy if it's possible, but haven't found a way to specify it. The reason is I want to run this on a schedule and only copy any new data since last run.

回答1:

If your blob name is well named with timestamp, you could follow this doc to copy partitioned data. You could use copy data tool to setup the pipeline. You could select tumbling window and then in file path filed input {year}/{month}/{day}/fileName and choose the right pattern. It will help you construct the parameters.
If you blob name is not well named with timestamp, you could use get metadata activity to check the last modified time. Please reference this post.

Event trigger is just one way to control when the pipeline should run. You could also use tumbling window trigger or schedule trigger in your scenarios.

回答2:

I'm going to presume that by 'incremental' you mean new blobs added to a container. There is no easy way to copy changes to a specific blob.

So, this is not possible automatically when running on a schedule since 'new' is not something the scheduler can know.

Instead, you can use a Blob created Event Trigger, then cache the result (Blob name) somewhere else. Then, when your schedule runs, it can read those names and copy only those blobs.

You have many options to cache. A SQL Table, another blob.

Note: The complication here is trying to do this on a schedule. If you can adjust the parameters to merely copy every new file, it's very, very easy because you can just copy the blob that created the trigger.

Another option is to copy the blob on create using the trigger to a temporary/staging container, then use a schedule to move those files to the ultimate destination.

来源：https://stackoverflow.com/questions/51936847/azure-datafactory-incremental-blob-copy

标签

azure

azure-data-factory

azure-data-factory-2