Staging in ETL: Best Practices?

此生再无相见时 提交于 2020-01-17 04:16:06

问题


Currently, the architecture I work with takes a few data sources out of which one is staged locally because it's hosted in the cloud. The others are hosted locally anyway, so the ETL I perform takes it directly from the source. I don't really see the point in creating a stage for the other sources.

1) Is there a distinct benefit to duplicating the locally hosted source into a local stage?

2) Is it a better idea to host the stage on a separate machine or the same one as the Warehouse?

3) If I'm trying to reduce my ETL time, what's a good way to do so? I was considering partitioning my data so that the important information is pulled more frequently than the "archived data". Is this a good approach, and what are my alternatives?


回答1:


@omgitsdev There are a few concepts I would like to clarify.

Your files can be hosted anywhere - locally or on cloud The files are loaded into a temporary table to be loaded into your Data Warehouse. This process is called staging.

Conceptually you can have your staging area anywhere; however to reduce connectivity issues, we create a separate schema in the same database and stage them. This will ensure that your performance is not hampered by connectivity issues.

you generally partition your fact table by the column which holds the date; this is easier and also the most recent partitions hold the latest data;

Based on the volume, you either make it a monthly, quarterly or yearly partition; there are situations where we also create daily or hourly partitions.

Your performance can also be accelerated by ensuring that the staging tables are in a separate disk from the data warehouse tables.



来源:https://stackoverflow.com/questions/23997776/staging-in-etl-best-practices

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!