etl

Recurrent machine learning ETL using Luigi

半腔热情 提交于 2019-12-11 06:58:34
问题 Today, running the machine learning job I've written is done by hand. I download the needed input files, learn and predict things, output a .csv file, which I then copy into a database. However, since this is going into production, I need to automate all this process. The needed input files will arrive every month (and eventually more frequently) into a S3 bucket from the provider. Now I'm planning using Luigi to solve this problem. Here is the ideal process: Every week (or day, or hour,

Is there a way to save all queries present in a ssis package/dtsx file?

五迷三道 提交于 2019-12-11 06:41:43
问题 I need to run some analysis on my queries (specifically finding all the tables which a ssis calls). Right now I'm opening up every single ssis package, every single step in it and copy and pasting manually the tables from it. As you can imagine it's very time consuming and mind-numbing. Is there a way to do export all the queries automatically ? btw i'm using sql server 2012 回答1: Retrieve Queries is not a simple process, you can work in two ways to achieve it: Analyzing the .dtsx package XML

Talend performance

我怕爱的太早我们不能终老 提交于 2019-12-11 06:28:28
问题 We have a requirement where we are reading data from three different files and doing joins among these files with different columns in the same job. Each file size is around 25-30 GB. Our system RAM size is just 16GB. Doing joins with tmap. Talend is keeping all the reference data in physical memory. In my case, i cannot provide that much memory. Job fails due to out of memory. If i use join with temp disk option in tmap, job was dead slow. Please help me with these questions. How Talend

SSIS problems with MySQL Connector/ODBC 5.3.8

自古美人都是妖i 提交于 2019-12-11 06:22:33
问题 I am having a problem with a SSIS project that downloads data from a MySQL database and insert it in a SQL Server 2014 Database. So I have two versions of the same project, one for SQL Server 2016 and another one for SQL Server 2014. They have the same scripts and data flows, but for some reason, only the one made for SQL Server 2016 works. The issues resides in the ODBC Driver connector. I can preview data in both project, but for the SQL Server 2014 version, it simply won't load it. So I

Getting “External table is not in the expected format.” error while trying to import an Excel File in SSIS

谁说胖子不能爱 提交于 2019-12-11 06:15:54
问题 I am trying to import an Excel file ( .xls ) via SSIS to a table in SQL Server. But SSIS doesn't seem to recognize the file as a valid Excel file. I get the following errors: Error 1: [Excel Source [86]] Error: SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER. The AcquireConnection method call to the connection manager "Carga Base Original" failed with error code 0xC0202009. There may be error messages posted before this with more information on why the AcquireConnection

Filter dynamically xml child element with xslt with SSIS

荒凉一梦 提交于 2019-12-11 05:26:24
问题 I've a big xml file from ICECAT. And I want take only some informations. It's in the following of this subject how transform xml document with xslt to duplicate output lines according to the child node I've in Database a table of language. And I want with xslt filter child elements <Name> depending of my table content. I'm in SSIS project. 回答1: 1/I create a variable named Filter and a Foreach ADO Enumerator putting enuration in variable IdLang 2/ Use a expression task with this expression : @

Transform a folder of CSV files the same way, then output multiple dataframes with python

岁酱吖の 提交于 2019-12-11 04:13:45
问题 I've got a folder of csv files that I need to transform and manipulate/clean up, outputting a dataframe that I can then continue to work with. I'd like one dataframe uniquely titled per CSV file that I have. I wrote the code to be able to manipulate just one of the csv files the way that I'd like to, with a clean dataframe at the end, but I'm getting tripped up on attempting to iterate through the folder and transform all of the csv files, ending with a dataframe per csv. Here's the code I've

BigQuery to Hadoop Cluster - How to transfer data?

一曲冷凌霜 提交于 2019-12-11 04:07:42
问题 I have a Google Analytics (GA) account which tracks the user activity of an app. I got BigQuery set up so that I can access the raw GA data. Data is coming in from GA to BigQuery on a daily basis. I have a python app which queries the BigQuery API programmatically. This app is giving me the required response, depending on what I am querying for. My next step is to get this data from BigQuery and dump it into a Hadoop cluster. I would like to ideally create a hive table using the data. I would

ssis output json file adding extra crlf

不打扰是莪最后的温柔 提交于 2019-12-11 03:58:27
问题 This has been solved with c# code, please refer to this post the post I have package inside a ole db source to run a query to generate Json file for a flat file destination. The query: Note I have used replace to get rid of crlf if they do exist in the column activity (which i don't see) SELECT Replace(Replace([activity], char(13),''), char(10),'') [activity] FROM [CRM].[dbo].[JJVCACUProductElectron] for json auto The generated json is with some crlf so can see multiple lines if viewed in

Batch convert visual foxpro dbf tables to csv

不打扰是莪最后的温柔 提交于 2019-12-11 03:49:03
问题 I have a huge collection of visual foxpro dbf files that I would like to convert to csv. (If you like, you can download some of the data here. Click on the 2011 link for Transaction Data, and prepare to wait a long time...) I can open each table with DBF View Plus (an awesome freeware utility), but exporting them to csv takes a few hours per file, and I have several dozen files to work with. Is there a program like DBF View plus that will allow me to set up a batch of dbf-to-csv conversions