data-warehouse

Is it better to have a surrogate key or nk+effective_time in dimension tables in apache hive

笑着哭i 提交于 2019-12-08 00:27:44
问题 Lets say, there is a SCD2 dimension table - location. The natural key is country, state and city combined. Since it is SCD2 table, eff date is also part of the key. Is it better to have the surrogate key as usavirginarichmond20110101 or create an actual numerical key using row_number() in hive? Why one approach is better over another? 回答1: (Note on terminology: combination of natural keys is called "composite key", not surrogate key, and it's still a "natural key". Surrogate key (aka

How to model process and status history in a data warehouse?

旧城冷巷雨未停 提交于 2019-12-07 07:32:51
问题 Let's say that we have D_PROCESS , D_WORKER and D_STATUS as dimensions, and the fact F_EVENT that links a process (what) with a worker (who's in charge) and the "current" status. The process status changes over time. Shoud we store in F_EVENT one line per process/status/worker, or one line per process/worker, and "somewhere else" one line per status change for a given process/worker? I'm new to Datawarehouse and it's hard to find best practices/tutorial related to data modelization. 回答1: Read

Datawarehouse Tutorial [closed]

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-07 06:59:34
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . My boss has discovered a new magazine which mentioned data warehousing. Thus I am in search of a good tutorial or book on data

Insert into a star-schema

人走茶凉 提交于 2019-12-07 06:11:16
问题 I've read a lot about star-schema's, about fact/deminsion tables, select statements to quickly report data, however the matter of data entry into a star-schema seems aloof to me. How does one "theoretically" enter data into a star-schema db? while maintaining the fact table. Is a series of INSERT INTO statement within giant stored proc with 20 params my only option (and how to populate the fact table). Many thanks. 回答1: Start with dimensions first -- one by one. Use ECCD (Extract, Clean,

Merging tables with duplicate data

元气小坏坏 提交于 2019-12-07 04:54:49
问题 For a SQL Server datawarehouse, I need to match 2 tables containing roughly the same data. There is obviously more to it than this, so redefining the task is not an option :-) Given 2 tables, A and B Table A: id | fid | type ------------------- 100 | 1 | cookies 110 | 1 | muffins 120 | 1 | muffins Table B: id | fid | type -------------------- a220 | 1 | muffins b220 | 1 | muffins When merged (apply secret IT here - SQL), it should become A_B: A_id | B_id | fid | type -------------------------

Handling nulls in Datawarehouse

烈酒焚心 提交于 2019-12-07 03:09:32
问题 I'd like to ask your input on what the best practice is for handling null or empty data values when it pertains to data warehousing and SSIS/SSAS. I have several fact and dimension tables that contain null values in different rows. Specifics: 1) What is the best way to handle null date/times values? Should I make a 'default' row in my time or date dimensions and point SSIS to the default row when there is a null found? 2) What is the best way to handle nulls/empty values inside of dimension

How to handle Bridge table in Star Schema

佐手、 提交于 2019-12-06 12:34:55
I am trying to build a star schema from an E/R diagram (OLTP system) that seems to contain a bridge table. Order is an obvious fact-table and product a dimension-table. I can't see how I can keep the bridge table if the model needs to be a star schema. How would you tackle this relationship if I need to keep information about Channel in the model? It depends on how you plan to use the model. If you only need to answer product and channel questions about existing orders, then you can avoid the bridge table altogether, because M2M relations between channels and products can be resolved though

Returning empty rows in GROUP BY clause [duplicate]

不问归期 提交于 2019-12-06 11:45:18
问题 This question already has answers here : MySQL GROUP BY and Fill Empty Rows (2 answers) Closed 5 years ago . I have a query to retrieve a total number of events per day between two dates. SELECT DATE_FORMAT(startTime, '%Y%m%d') as day, COUNT(0) as numEvents FROM events WHERE (startTime BETWEEN 20140105000000 AND 20140112235900) GROUP BY day; This return a list like day | numEvents ---------+---------- 20140105 | 45 20140107 | 79 20140108 | 12 20140109 | 56 Notice that there are missing days

What are the types of dimension tables in star schema design? [closed]

谁都会走 提交于 2019-12-06 09:37:12
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . When reading about star schema design I have seen that many people uses various names for different types of dimension tables. Please

Is it better to have a surrogate key or nk+effective_time in dimension tables in apache hive

亡梦爱人 提交于 2019-12-06 09:14:36
Lets say, there is a SCD2 dimension table - location. The natural key is country, state and city combined. Since it is SCD2 table, eff date is also part of the key. Is it better to have the surrogate key as usavirginarichmond20110101 or create an actual numerical key using row_number() in hive? Why one approach is better over another? (Note on terminology: combination of natural keys is called "composite key", not surrogate key, and it's still a "natural key". Surrogate key (aka Synthetic key) is a sequential integer that has no business meaning). Short answer: since your dimension is SCD2,