Is it better to have a surrogate key or nk+effective_time in dimension tables in apache hive

亡梦爱人 提交于 2019-12-06 09:14:36

(Note on terminology: combination of natural keys is called "composite key", not surrogate key, and it's still a "natural key". Surrogate key (aka Synthetic key) is a sequential integer that has no business meaning).

Short answer: since your dimension is SCD2, definitely use surrogate/synthetic keys. Handling SCD with natural/composite keys is a pain.

Longer answer: Surrogate (SK) vs Natural keys (NK) design is an on-going debate. Each has pros and cons. My approach is to always use surrogate keys in data warehouse (DW). It means some extra ETL work, but that's an acceptable cost because surrogate keys have some important advantages:

  1. SCD handling is much easier. If you have SCDs, using natural keys is rather cumbersome and ugly. Synthetic keys don't have the problem;

  2. System-wide consistency: because of SCD, it's highly likely that you will have to use SKs in your Data Warehouse at least in some tables. It makes sense then to consistently use them in all tables. Mixing SK and NK designs is ugly;

  3. Composite NKs can often be large and complex alpha-numeric strings. It means that they might substantially increase table sizes, and joins might be slower. SK is a simple integer, with predictable size and consistent join speed;

  4. NKs can be a source of bugs and instability in DW. For example, some databases re-use their natural keys, and as a result their meaning might change over time. In DW that relies on NKs that's a potential disaster. Also, NKs might come from a wide variety of sources, and lead to integration conflicts.

There are other considerations, but in my experience, systematically using surrogate keys makes DW design more reliable and efficient.

You can partition by effective_date for faster filtering/joining only with partitions only with effective date. And what surrogate key like this usavirginarichmond20110101 will give you ? Full scans because filtering will be on substr. So, keep country, state, city and effective_date separately as a key and partition by effective_date.

And one more important point: numerical key using row_number() in hive is not good solution because it's generation is running not in distributed mode. Better use GUID for this purpose.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!