Staging in ETL: Best Practices?

问题

Currently, the architecture I work with takes a few data sources out of which one is staged locally because it's hosted in the cloud. The others are hosted locally anyway, so the ETL I perform takes it directly from the source. I don't really see the point in creating a stage for the other sources.

1) Is there a distinct benefit to duplicating the locally hosted source into a local stage?

2) Is it a better idea to host the stage on a separate machine or the same one as the Warehouse?

3) If I'm trying to reduce my ETL time, what's a good way to do so? I was considering partitioning my data so that the important information is pulled more frequently than the "archived data". Is this a good approach, and what are my alternatives?

回答1:

@omgitsdev There are a few concepts I would like to clarify.

Your files can be hosted anywhere - locally or on cloud The files are loaded into a temporary table to be loaded into your Data Warehouse. This process is called staging.

Conceptually you can have your staging area anywhere; however to reduce connectivity issues, we create a separate schema in the same database and stage them. This will ensure that your performance is not hampered by connectivity issues.

you generally partition your fact table by the column which holds the date; this is easier and also the most recent partitions hold the latest data;

Based on the volume, you either make it a monthly, quarterly or yearly partition; there are situations where we also create daily or hourly partitions.

Your performance can also be accelerated by ensuring that the staging tables are in a separate disk from the data warehouse tables.

来源：https://stackoverflow.com/questions/23997776/staging-in-etl-best-practices

标签

pentaho

etl

data-warehouse