etl

OrientDB ETL Edge transformer 2 joinFieldName(s)

匿名 (未验证) 提交于 2019-12-03 09:02:45
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: with one joinFieldName and lookup the Edge transformer works perfect. However, now two keys is required, i.e. compound index in the lookup. How can two joinFieldNames be specified? This is the scripted(post processing) version: Create edge Expands from (select from MC where sample=1 and mkey=6) to (select from Event where sample=1 and mcl=6) . This works, but is not suitable for production. Can anyone help? 回答1: you can simply add 2 joinFieldName(s) like { "edge": { "class": "Conn", "joinFieldName": "b1", "lookup": "A.a1", "joinFieldName":

How do I integrate TFS Source Control with Business Intelligence Studio?

走远了吗. 提交于 2019-12-03 08:03:19
I am running Visual Studio 2010 Ultimate -- which integrates with TFS source control. However, when I run SQL Server 2008 Business Inteligence Studio, no source control is offered. When I look under Tools... Options... Source Control... there are no plug-ins available. Is this because BI Studio uses the 2008 Visual Studio Shell and I only have VS 2010? TIA. I think you need to do the following: Install Visual Studio Team Explorer 2008 Install Visual Studio 2008 SP1 Install Visual Studio 2008 SP1 Forward Compatibility Update for Team Foundation Server 2010 (KB974558) Here is a list of items

Open Source ETL framework [closed]

别来无恙 提交于 2019-12-03 07:38:20
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I was asked to prototype two ETL frameworks. The requirements are as follows: Open Source Available to Linux Maintained Logs can be

Generating seed code from existing database in ASP.NET MVC

蹲街弑〆低调 提交于 2019-12-03 07:09:35
I wondered if anyone has encountered a similar challenge: I have a database with some data that was ETL'ed (imported and transformed) in there from an Excel file. In my ASP.NET MVC web application I'm using Code First approach and dropping/creating every time database changes: #if DEBUG Database.SetInitializer(new DropCreateDatabaseIfModelChanges<MyDataContext>()); #endif However, since the data in the Database is lost, I have to ETL it again, which is annoying. Since, the DB will be dropped only on model change, I will have to tweak my ETL anyway, I know that. But I'd rather change my DB seed

How to create a SSIS package to ETL JSON from Python REST API request into MSSQL server?

匿名 (未验证) 提交于 2019-12-03 01:39:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying to take a Python script I wrote that makes GET requests utilizing a REST API and returns data in the form of JSON and then have that data be inserted into a SQL server that I will have to create. This job will need to run each day at least once. I am not familiar with creating tables in MSSQL let alone creating a SSIS package or working with ETL. I would appreciate some direction as to how to do this and how realistic it is for somebody with little actual experience, but a good understanding of the process itself conceptually. My

AWS Glue ETL job from AWS Redshift to S3 fails

匿名 (未验证) 提交于 2019-12-03 01:38:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying out AWS Glue service to ETL some data from redshift to S3. Crawler runs successfully and creates the meta table in data catalog, however when I run the ETL job ( generated by AWS ) it fails after around 20 minutes saying "Resource unavailable". I cannot see AWS glue logs or error logs created in Cloudwatch. When I try to view them it says "Log stream not found. The log stream jr_xxxxxxxxxx could not be found. Check if it was correctly created and retry." I would appreciate it if you could provide any guidance to resolve this

How to paralelize spark etl more w/out losing info (in file names)

匿名 (未验证) 提交于 2019-12-03 01:33:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: I'm going over a list of files on HDFS one by one, opening it as text and then saving back to HDFS, to another location. Data is being parsed, then part files are merged and saved to same name as original, with BZIP2 suffix. However, it's rather slow - takes ~3s for each file, and I have over 10,000 of them per folder. I need to go file by file because I'm unsure how to keep the file name information. I need name to be able to do a MD5 and "confirm" no information loss has happened. Here's my code: import org . apache . hadoop . fs

maven deploy:deploy-file fails (409 Conflict), yet artifact uploads successfully

匿名 (未验证) 提交于 2019-12-03 00:59:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: NOTE: I now realize that the jar got placed into my repository, but the pom.xml did not. Now, I have another project where the pom.xml fails to get promoted, but the jar is placed in the repository. However, another project, both the pom.xml and the jar do get placed in the repository. I have a project in Jenkins where I use the promotion plugin to deploy my artifacts in Maven via the deploy:deploy-file goal. This works for several other projects I have in Maven, but it fails for this project. The funny thing is that the file (but not the

【转】ETL讲解(很详细!!!)

匿名 (未验证) 提交于 2019-12-03 00:40:02
ETL是将业务系统的数据经过抽取、清洗转换之后加载到数据仓库的过程,目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据。   ETL的实现有多种方法,常用的有三种。一种是借助ETL工具(如Oracle的OWB、SQL Server 2000的DTS、SQL Server2005的SSIS服务、Informatic等)实现,一种是SQL方式实现,另外一种是ETL工具和SQL相结合。前两种方法各有各的优 缺点,借助工具可以快速的建立起ETL工程,屏蔽了复杂的编码任务,提高了速度,降低了难度,但是缺少灵活性。SQL的方法优点是灵活,提高ETL运行效 率,但是编码复杂,对技术要求比较高。第三种是综合了前面二种的优点,会极大地提高ETL的开发速度和效率。    一、 数据的抽取(Extract)   这一部分需要在调研阶段做大量的工作,首先要搞清楚数据是从几个业务系统中来,各个业务系统的数据库服务器运行什么DBMS,是否存在手工数据,手工数据量有多大,是否存在非结构化的数据等等,当收集完这些信息之后才可以进行数据抽取的设计。    1、对于与存放DW的数据库系统相同的数据源处理方法   这一类数据源在设计上比较容易。一般情况下,DBMS(SQLServer、Oracle)都会提供数据库链接功能

ETL数据抽取工具

匿名 (未验证) 提交于 2019-12-03 00:30:01
ETL负责将分布的、异构数据源中的数据如关系数据、 平面数据文件等抽取到临时中间层后进行清洗、转换、集成,最后加载到数据仓库或数据集市中,成为联机分析处理、数据挖掘的基础。 旗鼓相当: Datastage 与 Powercenter : 就Datastage和Powercenter而言,这两者目前占据了国内市场绝大部分的份额,在成本上看水平相当,虽然市面上还有诸如Business Objects公司的Data Integrator、Cognos公司的DecisionStream,但尚属星星之火,未成燎原之势。 谈Datastage和Powercenter,如果有人说这个就是比那个好,那听者就要小心一点了。在这种情况下有两种可能:他或者是其中一个厂商的员工,或者就是在某个产品上有很多经验而在另一产品上经验缺乏的开发者。为什么得出这一结论?一个很简单的事实是,从网络上大家对它们的讨论和争执来看,基本上是各有千秋,都有着相当数量的成功案例和实施高手。确实,工具是死的,人才是活的。 在两大ETL工具技术的比对上,可以从对ETL流程的支持、对元数据的支持、对数据质量的支持、维护的方便性、定制开发功能的支持等方面考虑。 一个项目中,从数据源到最终目标表,多则上百个ETL过程,少则也有十几个。这些过程之间的依赖关系、出错控制以及恢复的流程处理,都是工具需要重点考虑。在这一方面