data-lake

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

谁都会走 提交于 2020-12-31 20:17:46
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

拥有回忆 提交于 2020-12-31 20:07:44
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

扶醉桌前 提交于 2020-12-31 20:01:17
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

Data Governance solution for Databricks, Synapse and ADLS gen2

送分小仙女□ 提交于 2020-06-10 06:45:31
问题 I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities. We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more. Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay

Data Governance solution for Databricks, Synapse and ADLS gen2

落花浮王杯 提交于 2020-06-10 06:43:25
问题 I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities. We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more. Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay

Backup of Data Lake Store

与世无争的帅哥 提交于 2020-01-16 19:46:07
问题 I am working on a backup strategy for Data Lake Store (DLS). My plan is to create two DLS accounts and copy data between them. I have evaluated several approaches to achieve this but none of them satisfies the requirement to preserve the POSIX ACLs (permissions in DLS parlance). PowerShell cmdlets require data to be downloaded from the primary DLS onto a VM and re-uploaded onto the secondary DLS. The AdlCopy tool works only on Windows 10, does not preserve permissions and neither supports

Is Data Lake and Big Data the same?

扶醉桌前 提交于 2019-12-31 02:41:47
问题 I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes necessary, so, When can we say that we are using big data or data lake? Thanks in advance 回答1: I can't say I've come across the term 'big repository' before, but to answer the original question, no, data lake and big data are not the same, although in fairness they are both thrown around a lot and the

Is Data Lake and Big Data the same?

非 Y 不嫁゛ 提交于 2019-12-02 02:27:14
I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes necessary, so, When can we say that we are using big data or data lake? Thanks in advance I can't say I've come across the term 'big repository' before, but to answer the original question, no, data lake and big data are not the same, although in fairness they are both thrown around a lot and the definitions vary depending who you ask, but I'll try to give it a shot: Big Data Is used to describe both the