etl

ETL Operation - Return Primary Key

你说的曾经没有我的故事 提交于 2019-12-23 03:17:24
问题 I am using Talend to populate a data warehouse. My job is writing customer data to a dimension table and transaction data to the fact table. The surrogate key (p_key) on the fact table is auto-incrementing. When I insert a new customer, I need my fact table to reflect the id of the related customer. As I mentioned my p_key is auto auto_incrementing so I can't just insert an arbitrary value for the p_key. Any thought on how I can insert a row into my dimension table and still retrieve the

Database design for incremental “export” to data warehouse

谁说我不能喝 提交于 2019-12-23 02:54:07
问题 Given a 1 TB relational database, currently in SQL Server. The data warehouse needs a "copy" of major parts of the database. The warehouse data should not be more than 24 hours old. The size of the relational database makes it impractical to do a full load every night. How should I design my relational database to support incremental load to the warehouse? A very small portion (<0.1%) of the database changes in a single day, and it is mostly inserts. The intra-day changes are not required,

Parametrized transformation from Pentaho DI server console

一笑奈何 提交于 2019-12-22 12:27:50
问题 I can execute a independent scheduled transformation from pentaho DI server console . But, issue on running a parametrized scheduled transformation from pentaho DI server console .How can i pass parameter value at run time . In pentaho BI server , to execute parametrized report we used to pass variable value in URL . tried same in pentho DI server as below but didnt worked http:// * * /pentaho-di/kettle/transStatus?name=UI_parameter&Values=Testvalue 来源: https://stackoverflow.com/questions

Is Alteryx an ETL tool? How it differs from SSIS? [closed]

一曲冷凌霜 提交于 2019-12-22 08:59:16
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . My client want me to implement ETL process using Alteryx as they have a license of it. I am confused whether the Alteryx is an ETL tool or not. I believe that Alteryx is commonly used to prepare data for Tableau data visualization tool. Please advise whether its an ETL tool or not? How it differs from SSIS?

Unable to fetch “ReadWrite” Variable Value in Script Component of SSIS

微笑、不失礼 提交于 2019-12-22 08:03:23
问题 In Script Component [Input0_ProcessInputRow], i am trying to fetch "ReadWrite" global variable value and it throws below error. ERROR: The collection of variables locked for read and write access is not available outside of PostExecute. Below is my code If Row.Column13 = "C" Then Variables.mTotalCreditCount = Variables.mTotalCreditCount - 1 Variables.mTotalCreditAmount = Variables.mTotalCreditAmount - CDbl(Row.Column14) ElseIf Row.Column13 = "D" Then Variables.mTotalDebitCount = Variables

ETL(三):汇总转换器组件(聚合)和表达式组件的合用

不羁的心 提交于 2019-12-22 04:47:45
1、需求如下 2、在进行ETL开发之前,先创建一个edw用户。 3、创建一个test_aggregation文件夹,用于完成本次项目 注意:每个作业相当于是一个工程project,创建文件夹方便我们管理项目,同时ETL开发流程步骤太多,放在同一个文件夹下显得更为合适了。 4、ETL开发流程如下 整个ETL开发流程的详细步骤,可以参考我的另外一篇文章: https://blog.csdn.net/weixin_41261833/article/details/103625414 1)定义源表 2)定义目标表 ① 利用源表生成目标表; ② 双击该表,给目标表重新命名; ③ 对源表中的列进行筛选,保留或者自定义我们想要的列; ④ 生成并执行sql后,目标表中才会生成这个表的表结构(这一步很关键!!!); ⑤ 针对上述“生成数据库对象”表,做如下操作; ⑥ 执行完成以后,可以去edw用表下面查看已经生成了edw_ITEMS表; 3)创建映射 ① 创建映射; ② 将源表和目标表都拖拉到右侧的灰色区域; ③ 在源表和目标表中间加一个“汇总转换器”组件; ④ 双击“汇总转换器”组件,点击“端口”,出现如下原始界面; ⑤ 按照客户id分组,求price最大值,其他操作同理; 注:关于上图中1处的I和O的说明,可以参考我的另外一篇文章: https://blog.csdn.net/weixin

does pyodbc have any design advantages over pypyodbc?

戏子无情 提交于 2019-12-22 01:53:35
问题 I know pyodbc is an older project and probably more featureful and robust, but is there anything about its design (based on components of compiled C code), that would make it preferable to a pure Python implementation, such as pypyodbc? I do a lot of ETL work and am thinking of switching from a Linux/Jython/JDBC approach to Windows/Cygwin/Python/ODBC approach. 回答1: Potential advantages of pyodbc over pypyodbc by being written in C would be: speed - see the pypyodbc wiki comparison more

Creating real time datawarehouse

ε祈祈猫儿з 提交于 2019-12-21 21:28:21
问题 I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to full data mining processes and OLAP cubes. I have read that a data warehouse must be a relational database, and understand this. What I don't understand is how to achieve a near real time, or fully real time DWH. I have read about push and pull

Informatica writes rejected rows into a bad file, how to avoid that?

自古美人都是妖i 提交于 2019-12-21 20:25:00
问题 I have developed an Informatica PowerDesigner 9.1 ETL Job which uses lookup and an update transform to detect if the target table has the the incoming rows from the source or not. I have set for the Update transform a condition IIF(ISNULL(target_table_surrogate_id), DD_INSERT, DD_REJECT) Now, when the incoming row is already in the target table, the row is rejected. Informatica writes these rejected rows into a .bad file. How to prevent this? Is there a way to determine that the rejected rows

Writing JSON column to Postgres using Pandas .to_sql

廉价感情. 提交于 2019-12-21 07:49:07
问题 During an ETL process I needed to extract and load a JSON column from one Postgres database to another. We use Pandas for this since it has so many ways to read and write data from different sources/destinations and all the transformations can be written using Python and Pandas. We're quite happy with the approach to be honest.. but we hit a problem. Usually it's quite easy to read and write the data. You just use pandas.read_sql_table to read the data from the source and pandas.to_sql to