etl | 易学教程

DAG(directed acyclic graph) dynamic job scheduler

阅读更多关于 DAG(directed acyclic graph) dynamic job scheduler

问题 I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution. Are there any frameworks in python that can handle this? I see several core functions: DAG bulding Execution of nodes (run shell cmd with wait,logging etc.) Ability to rebuild sub-graph in

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

阅读更多关于 PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background: I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse. How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data

postgres安装及实现mysql到pg的数据迁移

阅读更多关于 postgres安装及实现mysql到pg的数据迁移

pg操作手册：http://www.ruanyifeng.com/blog/2013/12/getting_started_with_postgresql.html psql命令 :https://ywnz.com/linux/psql/ 线上sql已经开发完成测试数据库环境搭建完成 ods表结构：从源库导入中间表表结构：从线上pg库导入保留用户标签需要的源表和数仓中间表：文档底部列出用户标签的相关的源表和中间表清单：只导表数据，全量抽取，抽取前清空数据 supress_data: false supress_ddl: true force_truncate: true 开发负责的表均替换成正式命名的表： tag_user_temp tag_member_temp tag_user 中间表：f_order_item,d_user,f_user_list有调整(删除跟用户标签取值不相关的源表),其余中间表etl脚本与线上一致。 etl脚本需要单独从git_lab:bi_etl_dev/etl 导到测试服务器。 tag_user表：1 会员属性的标签全量更新。2 普通用户的标签必须增量更新其余中间表全量更新。测试调度任务脚本mysql2udw_dev.sh： /data/scripts/mysql2udw_dev.sh 源数据抽取配置文件yaml: /home

What should the converted data type of the corresponding column within the Data Converter SSIS Data Flow Component be?

阅读更多关于 What should the converted data type of the corresponding column within the Data Converter SSIS Data Flow Component be?

问题 We have the plain Microsoft SQL Server 2008 on one of our servers We decided to create DTSX files on the filesystem so that we can use BIDS 2008 to open the DTSX files One SSIS Control Flow Components that takes data from around 18-19 columns from a Microsoft SQL Server 2008 SQL select query, and then converts the values in order to place them in the Microsoft Access table. I have a number of columns that I retrieve from Microsoft SQL Server 2008 table using a DataFlow Component called OLE DB

ETL model with DAGs and Tasks

阅读更多关于 ETL model with DAGs and Tasks

问题 I'm trying to model my ETL jobs with Airflow. All jobs have kind of the same structure: Extract from a transactional database(N extractions, each one reading 1/N of the table) Then transform data Finally, insert the data into an analytic database So E >> T >> L This Company Routine USER >> PRODUCT >> ORDER has to run every 2 hours. Then I will have all the data from users and purchases. How can I model it? The Company Routine (USER >> PRODUCT >> ORDER ) must be a DAG and each job must be a

DAG(directed acyclic graph) dynamic job scheduler

阅读更多关于 DAG(directed acyclic graph) dynamic job scheduler

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution. Are there any frameworks in python that can handle this? I see several core functions: DAG bulding Execution of nodes (run shell cmd with wait,logging etc.) Ability to rebuild sub-graph in parent DAG during execution Ability to manual execute nodes or sub-graph while parent graph is running

Split a Value in a Column with Right Function in SSIS

阅读更多关于 Split a Value in a Column with Right Function in SSIS

I need an urgent help from you guys, the thing i have a column which represent the full name of a user , now i want to split it into first and last name. The format of the Full name is "World, hello", now the first name here is hello and last name is world. I am using Derived Column(SSIS) and using Right Function for First Name and substring function for last name, but the result of these seems to be blank, this where even i am blank. :) It's working for me. In general, you should provide more detail in your questions on places such as this to help others recreate and troubleshoot your issue.

How do I populate a rational multi-table MySQL database from an existing one table database?

阅读更多关于 How do I populate a rational multi-table MySQL database from an existing one table database?

Basically have many huge delimited files that I know I can import as a table, but I need to map that data to an existing rational multi-table MySQL database. There should not be any conflict with datatypes, but I'm super new to this, so please point out anything I should be watching for. Clearly I'm not going to run this in production either until I know it works. Not 100% sure stackoverflow is the right place to ask a database question, but I couldn't find any other Stack Exchange that was a better fit. Posted this question on SuperUser looking for a GUI to do this, but I up for coding this

How to extract a subset from a CSV file using NiFi

阅读更多关于 How to extract a subset from a CSV file using NiFi

I have a csv file say with 100+ columns and I want to extract only specific 60 columns as a subset(both column name + its value). I know we can use Extract Text processors. Can anyone tell me what regular expression to write ? Ex- Lets say from the given snapshot I only want NiFi to Extract 'BMS_sw_micro', 'BMU_Dbc_Dbg_Micro', 'BMU_Dbc_Fia_Micro' columns i.e. Extract only column 'F,L,O'. any help is much appreciated! As I said in the comment, you can Count the number of commas before the text, you want to match and use that in the RegEx, like this: /(?<=^([^,]+?,){5})[^,]+/ What the RegEx do

ETL和ELT简述

阅读更多关于 ETL和ELT简述

ETL ETL，是英文 Extract-Transform-Load 的缩写，用来描述将数据从来源端经过抽取（extract）、转换（transform）、加载（load）至目的端的过程。ETL一词较常用在数据仓库，但其对象并不限于数据仓库。 ETL是构建数据仓库的重要一环，用户从数据源抽取出所需的数据，经过数据清洗,最终按照预先定义好的数据仓库模型，将数据加载到数据仓库中去。 ETL在转化的过程中，主要体现在以下几方面: 空值处理：可捕获字段空值，进行加载或替换为其他含义数据，并可根据字段空值实现分流加载到不同目标库。规范化数据格式：可实现字段格式约束定义，对于数据源中时间、数值、字符等数据，可自定义加载格式。拆分数据：依据业务需求对字段可进行分解。例，主叫号 861082585313-8148，可进行区域码和电话号码分解。验证数据正确性：可利用Lookup及拆分功能进行数据验证。例如，主叫号861082585313-8148，进行区域码和电话号码分解后，可利用Lookup返回主叫网关或交换机记载的主叫地区，进行数据验证。数据替换：对于因业务因素，可实现无效数据、缺失数据的替换。 Lookup：查获丢失数据 Lookup实现子查询，并返回用其他手段获取的缺失字段，保证字段完整性。建立ETL过程的主外键约束：对无依赖性的非法数据，可替换或导出到错误数据文件中

订阅 etl