etl

ETL model with DAGs and Tasks

拜拜、爱过 提交于 2019-12-02 07:55:18
I'm trying to model my ETL jobs with Airflow. All jobs have kind of the same structure: Extract from a transactional database(N extractions, each one reading 1/N of the table) Then transform data Finally, insert the data into an analytic database So E >> T >> L This Company Routine USER >> PRODUCT >> ORDER has to run every 2 hours. Then I will have all the data from users and purchases. How can I model it? The Company Routine (USER >> PRODUCT >> ORDER ) must be a DAG and each job must be a separate Task? In this case, how can I model each step(E, T, L) inside the task and make them behave like

ETL之增量抽取方式

时光怂恿深爱的人放手 提交于 2019-12-02 07:08:19
1、触发器方式 触发器方式是普遍采取的一种增量抽取机制。该方式是根据抽取要求,在要被抽取的源表上建立插入、修改、删除3个触发器,每当源表中的数据发生变化,就被相应的触发器将变化的数据写入一个增量日志表,ETL的增量抽取则是从增量日志表中而不是直接在源表中抽取数据,同时增量日志表中抽取过的数据要及时被标记或删除。为了简单起见,增量日志表一般不存储增量数据的所有字段信息,而只是存储源表名称、更新的关键字值和更新操作类型(KNSEN、UPDATE或DELETE),ETL增量抽取进程首先根据源表名称和更新的关键字值,从源表中提取对应的完整记录,再根据更新操作类型,对目标表进行相应的处理。 例如,对于源表为ORACLE类型的数据库,采用触发器方式进行增量数据捕获的过程如下: 这样,对表T的所有DML操作就记录在增量日志表DML_LOG中,注意增量日志表中并没有完全记录增量数据本身,只是记录了增量数据的来源。进行增量ETL时,只需要根据增量日志表中的记录情况,反查源表得到真正的增量数据。 SQL代码 (1)创建增量日志表DML_LOG: CREATE TABLE DML_LOG( ID NUMBER PRIMARY KEY, //自增主键 TABLE NAME VARCHAR2(200). //源表名称 RECORD ID NUMBER, //源表增量记录的主键值 DML TYPE CH根(1

How to extract a subset from a CSV file using NiFi

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-02 06:54:19
问题 I have a csv file say with 100+ columns and I want to extract only specific 60 columns as a subset(both column name + its value). I know we can use Extract Text processors. Can anyone tell me what regular expression to write ? Ex- Lets say from the given snapshot I only want NiFi to Extract 'BMS_sw_micro', 'BMU_Dbc_Dbg_Micro', 'BMU_Dbc_Fia_Micro' columns i.e. Extract only column 'F,L,O'. any help is much appreciated! 回答1: As I said in the comment, you can Count the number of commas before the

How to look for updated rows when using AWS Glue?

◇◆丶佛笑我妖孽 提交于 2019-12-02 06:46:50
问题 I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift. As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows. However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of

How to pivot data using Informatica when you have variable amount of pivot rows?

冷暖自知 提交于 2019-12-02 06:45:59
问题 Based on my earlier questions, how can I pivot data using Informatica PowerCenter Designer when I have variable amount of Addresses in my data. I would like to Pivot e.g four addresses from my data. This is the structure of the source data file: +---------+--------------+-----------------+ | ADDR_ID | NAME | ADDRESS | +---------+--------------+-----------------+ | 1 | John Smith | JohnsAddress1 | | 1 | John Smith | JohnsAddress2 | | 1 | John Smith | JohnsAddress3 | | 2 | Adrian Smith |

Predicting data with Python script in an SSIS package

a 夏天 提交于 2019-12-02 06:26:17
问题 I'm aware of Microsoft's inclusion of Python in their Machine Learning Services for SQL server, however this is only available for SQL Server 2017 and up, which is a requirement my servers do not currently meet. With that being the case, I wanted to deploy my generate-predictions-with-trained-model pipeline entirely within SSIS, I.E: Grab data from my DB Pass it to a Python Script Data Flow Task which imports the trained model, generates the predictions and passes them on to the next Data

Zip a folder using SSIS

偶尔善良 提交于 2019-12-02 06:16:27
I am trying to zip a Folder in SSIS, there are 12 files in the source folder and I need to zipthat folder. I can get the files to zip fine my problem is the folders. I have to use winzip to create the zipped packages. Can anyone point me to a good tutorial. I haven't been able to implement any of the samples that I have found. Thanks Adding a Script Task, yuo can use the ZipFile (class) here reference , you must refer to the System.IO.Compression.FileSystem assembly in the project (.NET Framework 4.5). You need to provide to the Script Task the folder to be zipped and the name of the

How to call .dtsx file which has input parameters from a stored procedure?

陌路散爱 提交于 2019-12-02 06:00:40
问题 How to call .dtsx package file which has input parameters from a stored procedure? Stored Procedure # 1 -> Will pass the list of files to be exported to excel as a Comma Separated value in a variable. Input variable will be passed to the SSIS Package to export the data to excel. How to handle the SSIS Package which has Input parameters from a Stored Procedure call? 回答1: Using DtExec and xp_cmdshell One way to do that is to run DtExec utility from file system using xp_cmdshell utility inside

How to look for updated rows when using AWS Glue?

我的梦境 提交于 2019-12-02 04:57:39
I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift. As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows. However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of Spark, I'd love to hear it as well. You can use the query to find the updated records by filtering data

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column

时光总嘲笑我的痴心妄想 提交于 2019-12-02 04:09:40
How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case: Schema Source Table: Col1, Col2 After Glue job. Schema of Destination: Col1, Col2, Update_Date(Current Timestamp) I'm not sure if there's a glue native way to do this with the DynamicFrame , but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below. from pyspark.sql.functions import lit glue_df = glueContext.create_dynamic_frame.from_catalog(...) spark_df =