etl | 易学教程

Design a dimension with multiple data sources

阅读更多关于 Design a dimension with multiple data sources

问题 I am designing a few dimensions with multiple data sources and wonder what other people have done to align the multiple business keys per data source. My Example: I have 2 data sources - the Ordering System and the Execution System. The Ordering system has details about payment and what should happen; the Execution System has details on what actually happened (how long it took etc, who executed on the order). Data from both systems is need to created a single fact. In both the Ordering and

The connection “Connection” cannot be found. Verify that the connection manager has a connection with that name

阅读更多关于 The connection “Connection” cannot be found. Verify that the connection manager has a connection with that name

问题 I'm using an ODATA connection manager in SSIS to connect to a private Sharepoint Online, I manage to test the connection when I create the connection manager and to do a preview of my list when I'm in the designer. When I try to execute the package in VS I'm getting the following error: The connection "Connection" cannot be found. Verify that the connection manager has a connection with that name OData source failed validation and returned code 0xc020801A. I already created a new package and

Pentaho Kettle - Get the file names dynamically

阅读更多关于 Pentaho Kettle - Get the file names dynamically

问题 I hope this message finds everyone well! I'm stucked on a situation on Pentaho PDI Tool and I'm looking for an answer (or at least a light in the end of the cave) to solve it! I have to import, every month, a bunch of xls's files of differents clients. Every file has a different name (witch is given aleatory) and this files are on a folder named with the name of the client. However, I use the same process for all clients and situations. Is there a way to pass the name of the directory as a

ETL Tools that function well with ArangoDB - What are they?

阅读更多关于 ETL Tools that function well with ArangoDB - What are they?

问题 There are so many ETL tools out there. Not many that are Free. And of the Free choices out there they don't appear to have any knowledge of or support for ArangoDB. If anyone has dealt with the migration of their data over to ArangoDB and automated this process I would love to hear how you accomplished this. Below I have listed out several choices we have for ETL Tools. These choices I actually took from the 2016 Spark Europe presentation by Bas Geerdink. * IBM InfoSphere DataStage * Oracle

How to deal with Linebreaks in redshift load?

阅读更多关于 How to deal with Linebreaks in redshift load?

问题 I have a csv which has line breaks in one of the column. I get the error Delimiter not found. If I replace the text as continuous without line-breaks then it works. But how do I deal with line-breaks. My COPY command: COPY cat_crt_test_scores from 's3://rds-cat-crt-test-score-table/checkcsv.csv' iam_role 'arn:aws:iam::423639311527:role/RedshiftS3Access' explicit_ids delimiter '|' TIMEFORMAT 'auto' ESCAPE; Delimiter not found after reading till Dear Conduira, 来源： https://stackoverflow.com

Simplest tool in AWS for very simple (transform in) ETL?

阅读更多关于 Simplest tool in AWS for very simple (transform in) ETL?

问题 We have numerous files in S3 totally tens of gigabytes. We need to get them into CSV format, currently the files have delimiters that are not commas. Normally I would do this on a server using sed but I don't want to have to transfer the files to a server, I want to read directly from S3, translate to CSV line by line, and write the results back to new S3 files. Glue appears to be able to do this but I sense the learning curve and setup for such a simple task is overkill. Is there not some

Missing flowfile exception on Nifi processing cause loss of information

阅读更多关于 Missing flowfile exception on Nifi processing cause loss of information

问题 During an ETL process, we had random exception that causes loss of flowfile. Nifi is deployed on 3 nodes Kubernetes cluster with repositories on shared file-system (GlusterFS). We did some stress test and on 2000 files csv being processed almost 10% get lost with the exception reported. We tried also to scale down to one node and setting the number of parallel threads to 1 in order to minimize parallelism problems on the incriminated processors (validatecsv and validatejsonpath). It seems

How to extract and route only specified columns from a CSV files and drop all other columns [duplicate]

阅读更多关于 How to extract and route only specified columns from a CSV files and drop all other columns [duplicate]

问题 This question already has answers here : How to extract a subset from a CSV file using NiFi (2 answers) Closed last year . I want to extract few fields along with its value from a CSV file and drop/delete all other fields in the file. Please help. I think we can use RoutText processor.Please tell me how to write the regular expression for the routing only specified fields and dropping everything else. Thanks Example- from he snapshot attached I only want to route 'Firstname,Lastname and

SSIS split string

阅读更多关于 SSIS split string

问题 I have a dataset (log file) with a number of columns; one of them is Other-Data (see below) which is an unordered string and need to parse it to create new columns according the u value (U1, U2, U3, etc...) OTHER-DATA u1=EUR;u2=sss:Checkout-Step4:Orderacknowledgement;u3=DE:de:hom;u11=1;u12=302338533;u13=SVE1511C5E;u14=575.67 Can anyone help with this? 回答1: One way would be to add a script transformation component, use the OTHER-DATA as your input row, parse it with C# or VB.Net and output it

Best ETL Packages In Python

阅读更多关于 Best ETL Packages In Python

问题 I have 2 use cases: Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable). I am currently using: SQLAlchemy for extracts (works well generally). PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out). An internal tool for the scheduling component (which stores the