etl

Query a database based on result of query from another database

假如想象 提交于 2019-11-29 06:43:31
I am using SSIS in VS 2013. I need to get a list of IDs from 1 database, and with that list of IDs, I want to query another database, ie SELECT ... from MySecondDB WHERE ID IN ({list of IDs from MyFirstDB}) . There is 3 Methods to achieve this: 1st method - Using Lookup Transformation First you have to add a Lookup Transformation like @TheEsisia answered but there are more requirements: In the Lookup you Have to write the query that contains the ID list (ex: SELECT ID From MyFirstDB WHERE ... ) At least you have to select one column from the lookup table These will not filter rows , but this

System.ArgumentException: Object is not an ADODB.RecordSet or an ADODB.Record

谁说胖子不能爱 提交于 2019-11-29 04:52:50
I used the code below to fill a data table - OleDbDataAdapter oleDA = new OleDbDataAdapter(); DataTable dt = new DataTable(); oleDA.Fill(dt, Dts.Variables["My_Result_Set"].Value); I get the error - Error: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.ArgumentException: Object is not an ADODB.RecordSet or an ADODB.Record. Parameter name: adodb at System.Data.OleDb.OleDbDataAdapter.FillFromADODB(Object data, Object adodb, String srcTable, Boolean multipleResults) at System.Data.OleDb.OleDbDataAdapter.Fill(DataTable dataTable,

ETL必知必会----正则

我是研究僧i 提交于 2019-11-29 00:39:30
ETL简介 ETL(Extraction-Transformation-Loading)中文意思就是数据清洗(数据抽取、转换和加载),通俗的说法就是从数据源抽取数据出来,进行清洗加工转换,然后加载到定义好的数据仓库模型中去。目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据。 ETL是BI项目重要的一个环节,其设计的好坏影响生成数据的质量,直接关系到BI项目的成败。这个处理过程有很多方法,包括清洗工具的使用,以及HQL自带的函数,自定义udf函数,正则表达式等,其中正则表达式使用的非常之多,功能也十分强大. 什么是正则表达式及其功能 正则表达式,又称规则表达式. 正则表达式的英语原文为:Regular Expression,常简写为regex、regexp或RE,正则表达式是计算 机科学的一个概念. 正则表达式通常被用来检索、替换那些符合某个模式(规则)的文本. 刚开始接触正则的人肯定是一头蒙,但实际上正则并不是想象的那么难懂, 像 data(\w)?.dat 这样的正则表达式查找下列名字: 此时我们能找到所有的,如果我们吧 ? 改成 + 则只能匹配到后四个. 例如:该文本有这些内容,我们需要找出文本中的 hello 单词,一般我们都会直接ctrl+f 进行搜索,其实这个搜索的过程就相当于正则的 hello ,当我们只要开头那个hello

How to add a new Struct column to a DataFrame

可紊 提交于 2019-11-28 18:48:18
I'm currently trying to extract a database from MongoDB and use Spark to ingest into ElasticSearch with geo_points . The Mongo database has latitude and longitude values, but ElasticSearch requires them to be casted into the geo_point type. Is there a way in Spark to copy the lat and lon columns to a new column that is an array or struct ? Any help is appreciated! I assume you start with some kind of flat schema like this: root |-- lat: double (nullable = false) |-- long: double (nullable = false) |-- key: string (nullable = false) First lets create example data: import org.apache.spark.sql

How to approach an ETL mission?

做~自己de王妃 提交于 2019-11-28 16:35:41
I am supposed to perform ETL where source is a large and badly designed sql 2k database and a a better designed sql 2k5 database. I think SSIS is the way to go. Can anyone suggest a to-do list or a checklist or things to watchout for so that I dont forget anything? How should I approach this so that it does not bite me in the rear later on. Well i'm developing an ETL for the company where i am. We are working with SSIS. Using the api to generate and build our own dtsx packages. SSIS it's not friendly for managing errors. Sometimes you get an "OleDb Error" that could have a lot of different

ETL子系统

女生的网名这么多〃 提交于 2019-11-28 14:57:38
  最近在看《Pentaho Kettle 解决方案》,看到 ETL子系统,发现信息量比较大,用简短的语句做一下笔记。   ETL子系统有34种子系统,被分成4个部分:抽取、清洗和更正、发布、管理。 一、抽取  子系统1:数据剖析系统   指从不同源系统中搜集数据的统计信息或其他相关信息的过程,目的是分析不同数据源的结构和内容。  子系统2:增量数据捕获系统   目的是捕获系统里的数据的变化。由于数据量大以及网络的延迟,数据完成初始加载后,不应再把数据重新加载一边,为了识别出有变化或更新的数据,增加时间戳或快照的方式。  子系统3:抽取系统   从不同数据源抽取数据,并输入到ETL流程里。 二、清洗和更正   几乎没有什么数据是不存在问题的,因此数据加载到数据仓库之前要增加一些步骤来清洗和更正这些数据。另外,每个系统存储数据得方式不同,比如有些数据源里,性别表示为 0,1;有些数据源里用“男”,“女”表示,存进数据仓库里面应该有统一的规范。  子系统4:数据清洗和质量处理系统   这个过程主要是修改和整理进入到 ETL 流程的脏数据,提高数据的质量。  子系统5:错误事件处理   错误事件处理的目的是记录下 ETL 过程中的每一个错误。这样便于管理员定期监控和分析错误。  子系统6:审计纬度   审计维度表是一类特殊的维度表,数据仓库里的所有事实表都和审计纬度表关联

大数据模块开发----ETL

∥☆過路亽.° 提交于 2019-11-28 14:52:14
ETL工作的实质就是从各个数据源提取数据,对数据进行转换,并最终加载填充数据到数据仓库维度建模后的表中。只有当这些维度/事实表被填充好,ETL工作才算完成。 本项目的数据分析过程在hadoop集群上实现,主要应用hive数据仓库工具,因此,采集并经过预处理后的数据,需要加载到hive数据仓库中,以进行后续的分析过程。 1. 创建ODS层数据表1.1. 原始日志数据表 drop table if exists ods_weblog_origin; create table ods_weblog_origin( valid string, remote_addr string, remote_user string, time_local string, request string, status string, body_bytes_sent string, http_referer string, http_user_agent string) partitioned by (datestr string) row format delimited fields terminated by '\001'; 1.2. 点击流模型pageviews表 drop table if exists ods_click_pageviews; create table ods_click

WildCards in SSIS Collection {not include} name xlsx

冷暖自知 提交于 2019-11-28 14:06:48
I have a process built in SSIS that loops through Excel files and Import data only from those that include name Report . My UserVariable used as Expression is: *Report*.xlsx and it works perfectly fine. Now I am trying to build similar loop but only for files that DOES NOT include Report in file name. Something like *<>Report*.xlsx Is it possible? Thanks for help! Matt Unfortunately, you cannot achieve this using SSIS expression ( something like *[^...]*.xlsx ) you have to search for some workarounds: Workarounds First Get List of - filtered - files using an Execute Script Task before entering

how to check column structure in ssis?

Deadly 提交于 2019-11-28 14:01:09
I have a table customer in my sql server. Columns Distributer_Code Cust_code cust_name cust_add zip tel dl_number gstin we receive customer files from the distributor on a monthly basis. so sometimes they send files with the wrong structuer.. like maybe gstin is missing or dl_number is missing or gstin is in place of dl_number and dl_number is in place of tel...basically, columns could be split.. when we upload those flat files with SSIS it gives error..and data doesn't get uploaded on the server if the structure is wrong. I want to upload those data with null data if columns are missing or

SSIS How to get part of a string by separator

社会主义新天地 提交于 2019-11-28 13:19:05
I need an SSIS expression to get the left part of a string before the separator, and then put the new string in a new column. I checked in derived column, it seems no such expressions. Substring could only return string part with fixed length. For example, with separator string - : Art-Reading Should return Art Art-Writing Should return Art Science-chemistry Should return Science P.S. I knew this could be done in MySQL with SUBSTRING_INDEX() , but I'm looking for an equivalent in SSIS, or at least in SQL Server of course you can: just configure your derived columns like this: Here is the