etl

Java ETL process

十年热恋 提交于 2019-12-04 21:53:18
I have this new challenge to load ~100M rows from an Oracle database and insert them in a remote MySQL database server. I've divided the problem in two: a server side REST server responsible for loading data into the MySQL server; a client side Java app that is responsible from loading the Oracle data source. At the Java side I've used plain JDBC for loading paginated content and transfer it over the wire to the server. This approach works well but it makes the code cumbersome and not very scalable as I'm doing pagination myself using Oracle's ROWNUM.....WHERE ROWNUM > x and ROWNUM < y. I've

开源ETL工具:Talend系列

醉酒当歌 提交于 2019-12-04 19:57:03
Talend Open Studio(拓蓝开放工作室) Talend的旗舰产品, Talend Open Studio ,提供了迄今市场上最为开放,最具效力并最有创造力的数据集成方案。 拥有一个多功能合一,即装即用的应用平台, Talend Open Studio 可以满足所有组织机构的数据集成要求——无论其集成技术的高低或是项目规模的大小。 Talend Open Studio 秉承一贯的活力,将其强大的功能贯穿于数据集成的复杂过程中,即使在最为严苛的环境中也毫不逊色。 Talend Integration Suite(拓蓝集成套件) Talend Integration Suite 是一套业界领先的开源企业数据集成解决方案,它不仅满足了最为严格的企业发展要求,甚至可以完成针对最大规模数据和最为复杂过程的集成任务。 Talend Integration Suite 提供为您量身打造的订阅服务,并借此扩展了Talend金奖产品 Talend Open Studio 的各项功能,让用户享有了更多专业级别的技术支持和补充功能,有助于更大规模的团队合作,促进了面向企业规模的部署的产业化。 Talend Integration Suite MPx(拓蓝集成套件MPx) 依托Talend荣获金奖的企业数据集成技术, Talend Integration Suite MPx 具有高度的扩展性

Row level atomic MERGE REPLACE in BigQuery

独自空忆成欢 提交于 2019-12-04 19:08:53
For my use case I'm working with data identifiable by unique key at the source exploded into n (non deterministic) number of target entries loaded into BigQuery tables for analytic purposes. Building this ETL to use Mongo recent Change Stream feature I would like to drop all entries in BigQuery and then load the new entries atomically. Exploring BigQuery DML I see a MERGE operation is supported, but only WHEN MATCHED THEN DELETE or WHEN MATCHED THEN UPDATE is possible. I'm interested in a WHEN MATCHED THEN DELETE, AND FOLLOW BY AN INSERT operation. How would I implement such ETL in BigQuery

数据仓库ETL案例学习(一)

删除回忆录丶 提交于 2019-12-04 16:29:07
来自课程案例学习 某跨国食品超市的信息管理系统,每天都会记录成千上万条各地连锁超市的销售数据。基于大数据的背景,该公司的管理层决定建立FoodMart数据仓库,期望能从庞大的数据中挖掘出有商业价值的信息,来进一步帮助管理层进行决策。 设计一个销售数据仓库。要求: 1、至少4个维度,每个维度至少3个属性,尽量包含维层。 2、至少1个事实表。 3、数据源能获取(设计的维度和度量字段应该在数据源中直接或间接得到)。 * 以下使用SQL Server Integration Services (SSIS) (一)设计数据仓库概念模型,设计如下: (二)数据仓库数据源、数据视图、维度表装载 1. 建立一个项目 2. 将数据集导入sql server(本文将access数据库先转到sql server,再在SSIS里使用,也可以直接在SSIS里使用Access驱动) 3. 建立与sql server的连接 4. 依次装载数据产品维、顾客维、日期维、连锁店维、促销维。 产品维中涉及两个表product、product_class,根据生成查询获得了想要的数据 目标编辑器选择新建表来存入数据仓库。 同理装载商品维 同理装载促销维 装载客户维 同理装载时间维 这里需要将时间字符串进行分割,使用派生列和日期函数,分别建立年、月、日 (ps:这里不需要手动建立时间维,数据仓库提供了建立时间维的模板

Informatica writes rejected rows into a bad file, how to avoid that?

吃可爱长大的小学妹 提交于 2019-12-04 14:43:57
I have developed an Informatica PowerDesigner 9.1 ETL Job which uses lookup and an update transform to detect if the target table has the the incoming rows from the source or not. I have set for the Update transform a condition IIF(ISNULL(target_table_surrogate_id), DD_INSERT, DD_REJECT) Now, when the incoming row is already in the target table, the row is rejected. Informatica writes these rejected rows into a .bad file. How to prevent this? Is there a way to determine that the rejected rows are not written into a .bad file? Or should I use e.g. a router insted of an update transform to

How to ETL multiple files using Scriptella?

可紊 提交于 2019-12-04 14:16:12
I am having multiple log files 1.csv,2.csv and 3.csv generated by a log report. I want to read those files and parse them concurrently using Scriptella. Scriptella does not provide parallel job execution out of the box. Instead you should use a job scheduler provided by an operating system or a programming environment (e.g. run multiple ETL files by submitting jobs to an ExecutorService). Here is a working example to import a single file specified as a system property: ETL file : <!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd"> <etl> <connection id="in" driver="csv" url="

how to determine if a record in every source, represents the same person

与世无争的帅哥 提交于 2019-12-04 13:18:13
I have several sources of tables with personal data, like this: SOURCE 1 ID, FIRST_NAME, LAST_NAME, FIELD1, ... 1, jhon, gates ... SOURCE 2 ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ... 1, jon, gate ... SOURCE 3 ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ... 2, jhon, ballmer ... So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person . Additionally, sure not every records exists in all sources. All the names, are written in spanish, mainly. In this case, the exact matching needs to

SSIS: Flat File default length

烈酒焚心 提交于 2019-12-04 11:46:23
I have to import about 50 different types of files every day. Some of them with a few columns, some inculde up to 250 columns. The Flat File connection always defaults all columns to 50 chars. Some columns can be way longer than 50 chars, and will of course end up in errors. Currently i am doing a stupid search&replace with notepad++ - Opening all SISS packages, replacing: DTS:MaximumWidth="50" by DTS:MaximumWidth="500" This is an annoying workaround. Is there any possibility to set a default length for flatfile string columns to a certain value? I am developing in Microsoft Visual Studio

Split a Value in a Column with Right Function in SSIS

和自甴很熟 提交于 2019-12-04 06:59:22
问题 I need an urgent help from you guys, the thing i have a column which represent the full name of a user , now i want to split it into first and last name. The format of the Full name is "World, hello", now the first name here is hello and last name is world. I am using Derived Column(SSIS) and using Right Function for First Name and substring function for last name, but the result of these seems to be blank, this where even i am blank. :) 回答1: It's working for me. In general, you should

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column

℡╲_俬逩灬. 提交于 2019-12-04 05:08:13
问题 How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case: Schema Source Table: Col1, Col2 After Glue job. Schema of Destination: Col1, Col2, Update_Date(Current Timestamp) 回答1: I'm not sure if there's a glue native way to do this with the DynamicFrame , but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below. from