etl

What's the most efficient way to convert a MySQL result set to a NumPy array?

倖福魔咒の 提交于 2019-11-27 13:41:36
I'm using MySQLdb and Python. I have some basic queries such as this: c=db.cursor() c.execute("SELECT id, rating from video") results = c.fetchall() I need "results" to be a NumPy array, and I'm looking to be economical with my memory consumption. It seems like copying the data row by row would be incredibly inefficient (double the memory would be required). Is there a better way to convert MySQLdb query results into the NumPy array format? The reason I'm looking to use the NumPy array format is because I want to be able to slice and dice the data easily, and it doesn't seem like python is

How to add a new Struct column to a DataFrame

妖精的绣舞 提交于 2019-11-27 11:40:41
问题 I'm currently trying to extract a database from MongoDB and use Spark to ingest into ElasticSearch with geo_points . The Mongo database has latitude and longitude values, but ElasticSearch requires them to be casted into the geo_point type. Is there a way in Spark to copy the lat and lon columns to a new column that is an array or struct ? Any help is appreciated! 回答1: I assume you start with some kind of flat schema like this: root |-- lat: double (nullable = false) |-- long: double

How to approach an ETL mission?

柔情痞子 提交于 2019-11-27 09:48:01
问题 I am supposed to perform ETL where source is a large and badly designed sql 2k database and a a better designed sql 2k5 database. I think SSIS is the way to go. Can anyone suggest a to-do list or a checklist or things to watchout for so that I dont forget anything? How should I approach this so that it does not bite me in the rear later on. 回答1: Well i'm developing an ETL for the company where i am. We are working with SSIS. Using the api to generate and build our own dtsx packages. SSIS it's

Why does my ODBC connection fail when running an SSIS load in Visual Studio but not when running the same package using Execute Package Utility

喜你入骨 提交于 2019-11-27 08:55:03
I'm working on a Data Mart loading package in SSIS 2012. When attempting to execute the package in Visual Studio I get this error: "The AcquireConnection method call to the connection manager Data Warehouse.ssusr failed with error code 0xC0014009". When I test the connectivity of the Connection Manager Data Warehouse.ssusr I see that it passes. When I execute the package outside of Visual Studio using the Execute Package Utility, the package runs. I don't understand what's going on. The package also refuses to run using the SQL Server Job Schedule, if that has anything to do with anything.

ETL 开发笔记

橙三吉。 提交于 2019-11-27 08:31:13
一、Oracle的NVL函数用法【从两个表达式返回一个非 null 值。】? 语法【NVL(eExpression1, eExpression2)】 参数【eExpression1, eExpression2】 如果 eExpression1 的计算结果为 null 值,则 NVL( ) 返回 eExpression2。如果 eExpression1 的计算结果不是 null 值,则返回 eExpression1。eExpression1 和 eExpression2 可以是任意一种数据类型。如果 eExpression1 与 eExpression2 的结果皆为 null 值,则 NVL( ) 返回 .NULL.。 返回值类型【字符型、日期型、日期时间型、数值型、货币型、逻辑型或 null 值】 在不支持 null 值或 null 值无关紧要的情况下,可以使用 NVL( ) 来移去计算或操作中的 null 值。 【select nvl(a.name,'空得') as name from student a join school b on a.ID=b.ID】 注意:两个参数得类型要匹配 例如: SELECT T.D_FDATE, T.VC_ZHCODE, NVL(SUM(T.F_FZQSZ), 0) f_price_b, NVL(SUM(T.F_FZQCB), 0) f

Getting top n to n rows from db2

心已入冬 提交于 2019-11-27 08:12:06
问题 I need to split a huge table in to chunks. Fetching data from DB2 and processing in SSIS iteration 1 : Get first 10 rows and process it iteration 2 : Get next 10 rows(11-20) and process it iteration 3 : Get next 10 rows(21-30) and process it and so on till count(*) of a table Is it possible to get top n to n rows from db2 im looking for a query like below, select * from from tablename fetch 10 to 20 rows 回答1: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref

how to check column structure in ssis?

﹥>﹥吖頭↗ 提交于 2019-11-27 08:11:10
问题 I have a table customer in my sql server. Columns Distributer_Code Cust_code cust_name cust_add zip tel dl_number gstin we receive customer files from the distributor on a monthly basis. so sometimes they send files with the wrong structuer.. like maybe gstin is missing or dl_number is missing or gstin is in place of dl_number and dl_number is in place of tel...basically, columns could be split.. when we upload those flat files with SSIS it gives error..and data doesn't get uploaded on the

How to execute scheduled SQL script on Amazon Redshift?

混江龙づ霸主 提交于 2019-11-27 07:06:45
问题 I have series of ~10 queries to be executed every hour automatically in Redshift (maybe report success/failure). Most queries are aggregation on my tables. I have tried using AWS Lambda with CloudWatch Events, but Lambda functions only survive for 5 minutes max and my queries can take up to 25 minutes. 回答1: It's kind of strange that AWS doesn't provide a simple distributed cron style service. It would be useful for so many things. There is SWF, but the timing/scheduling aspect is left up to

Reverse engineering SSIS package using C#

断了今生、忘了曾经 提交于 2019-11-27 06:11:20
问题 There is a requirement to extract source , destination and column names of source and destination . Why am I trying to do this is because I have thousands of packages and opening each package has on an average 60 to 75 of columns and listing all required info will take huge amount of time and its not a single time requirement and this task is done manually every two months in my organization currently. I'm looking for some ways to reverse engineer keeping all packages in a single folder and

Automate process by running excel VBA macro in SSIS

Deadly 提交于 2019-11-27 05:51:28
问题 Recently, I have a project need to automate a process by combining SSIS package and excel VBA macro into one. Below are the steps: I have a SSIS package exporting all the view result to multiple individual file from sql server to excel. All the files are saved in same location. I have one excel VBA macro perform cleaning to remove all the empty sheets in each exported excel file. I also have a excel VBA macro perform merging task to merge all the excel file into in master excel file. This