etl

UPSERT in parquet Pyspark

假如想象 提交于 2020-07-19 01:59:52
问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

UPSERT in parquet Pyspark

不羁的心 提交于 2020-07-19 01:58:45
问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

SSIS File System Task Error while copying files between servers

无人久伴 提交于 2020-07-08 04:16:18
问题 I can copy files between two servers say Server A and Server B manually and I have permissions to folders on either side. I am using File System Task to Copy files. When my Source and Destination are within the Server the Package works fine in visual studio as well as SSISDB. When my Source and Destination are in different Servers the Package works fine in visual studio but package fails in SSISDB. It is saying access is denied. My Account is mapped to SSISDB. Any idea to solve this issue.

Azure data factory: Handling inner failure in until/for activity

可紊 提交于 2020-06-29 03:48:14
问题 I have an Azure data factory v2 pipeline containing an until activity. Inside the until is a copy activity - if this fails, the error is logged, exactly as in this post, and I want the loop to continue. Azure Data Factory Pipeline 'On Failure' Although the inner copy activity’s error is handled, the until activity is deemed to have failed because an inner activity has failed. Is there any way to configure the until activity to continue when an inner activity fails? 回答1: Solution Put the error

Import/Export DataFusion pipelines

帅比萌擦擦* 提交于 2020-06-26 04:07:16
问题 Does anyone know if it is possible to programmatically import/export DataFlow pipelines (deployed or in draft status)? The idea is to write a script to drop and create a DataFusion instance, in order to avoid billing when it's not used. Via gloud commandline it's possible to provision a DataFusion cluster and to destroy it, but it would be interesting to automatically export and import all my pipelines too. The official documentation, unfortunately, didn't help me... Thanks! 回答1: You could

ETL 1.5 GB Dataframe within pyspark on AWS EMR

孤人 提交于 2020-06-01 07:39:21
问题 I'm using an EMR cluster with 1 Master (m5.2x large) and 4 core nodes (c5.2xlarge) and running a PySpark job on it which will join 5 fact tables 150 columns and 100k rows each and 5 small dimension tables 10 columns each with less than 100 records. When I join all these tables the resultant dataframe will have 600 columns and 420k records (approximately 1.5 GB of data). Please suggest me something here, I'm from a SQL and DWH backgound. Hence I have used a single SQL query to join all 5 facts

How to provide input to jolt when multiple objects are given

北慕城南 提交于 2020-05-17 07:04:13
问题 How to apply transformation to the json file having records in the following format (not array, just multiple objects). I want to provide a file with following input format and after applying transfornation.Want to get it saved in some folder. example: Input Record Format { "name": "adam", "age": 12, "city": "australia" } { "name": "adam", "age": 12, "city": "australia" } { "name": "adam", "age": 12, "city": "australia" } { "name": "adam", "age": 12, "city": "australia" } { "name": "adam",

How to loop through excel file and get sheetname using ssis 2008

陌路散爱 提交于 2020-05-15 08:55:06
问题 I'm trying to load data from an excel file with a sheetname which is not static (sheetname contains yyyymmdd which would change with each file) into SQL database table. I followed the solution provided on How to loop through Excel files and load them into a database using SSIS package? but could only manage to get the first for loop working. When I'm trying to assign the user variable 'Sheetname' to Excel Source under the Data Flow task, I'm getting the error - Error at CSSN_Invoice

Python: Converting excel file to JSON format

夙愿已清 提交于 2020-05-13 04:58:12
问题 I am creating a ML model that will use a JSON file to understand the pattern and response format. As I have my data in excel format I converted it to JSON in python. Here is the code: import xlrd from collections import OrderedDict import simplejson as json # Open the workbook and select the first worksheet wb = xlrd.open_workbook('D:\\android\\testdata2.xlsx') sh = wb.sheet_by_index(0) # List to hold dictionaries data_list = [] # Iterate through each row in worksheet and fetch values into