aws-glue

Specify a SerDe serialization lib with AWS Glue Crawler

守給你的承諾、 提交于 2021-01-02 20:09:22
问题 Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe , which doesn't classify correctly (e.g. for quoted fields with commas in) I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde . I've tried making my own csv Classifier but that doesn't help. How do I get the crawler to specify a particular serialization lib for the tables produced or updated? 回答1: You can't

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

谁都会走 提交于 2020-12-31 20:17:46
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

拥有回忆 提交于 2020-12-31 20:07:44
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

扶醉桌前 提交于 2020-12-31 20:01:17
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

aws athena - Create table by an array of json object

瘦欲@ 提交于 2020-12-25 04:50:13
问题 Can I get help in creating a table on AWS Athena. For a sample example of data : [{"lts": 150}] AWS Glue generate the schema as : array (array<struct<lts:int>>) When I try to use the created table by AWS Glue to preview the table, I had this error: HIVE_BAD_DATA: Error parsing field value for field 0: org.openx.data.jsonserde.json.JSONObject cannot be cast to org.openx.data.jsonserde.json.JSONArray The message error is clear, but I can't find the source of the problem! 回答1: Hive running under

aws athena - Create table by an array of json object

会有一股神秘感。 提交于 2020-12-25 04:48:32
问题 Can I get help in creating a table on AWS Athena. For a sample example of data : [{"lts": 150}] AWS Glue generate the schema as : array (array<struct<lts:int>>) When I try to use the created table by AWS Glue to preview the table, I had this error: HIVE_BAD_DATA: Error parsing field value for field 0: org.openx.data.jsonserde.json.JSONObject cannot be cast to org.openx.data.jsonserde.json.JSONArray The message error is clear, but I can't find the source of the problem! 回答1: Hive running under

How to access data in subdirectories for partitioned Athena table

懵懂的女人 提交于 2020-12-12 18:50:12
问题 I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows: s3://my-bucket/data/2019/06/27/00/00001.json s3://my-bucket/data/2019/06/27/00/00002.json s3://my-bucket/data/2019/06/27/01/00001.json s3://my-bucket/data/2019/06/27/01/00002.json Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data. ALTER TABLE mytable ADD PARTITION (year=2019, month

How to access data in subdirectories for partitioned Athena table

試著忘記壹切 提交于 2020-12-12 18:49:15
问题 I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows: s3://my-bucket/data/2019/06/27/00/00001.json s3://my-bucket/data/2019/06/27/00/00002.json s3://my-bucket/data/2019/06/27/01/00001.json s3://my-bucket/data/2019/06/27/01/00002.json Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data. ALTER TABLE mytable ADD PARTITION (year=2019, month

AWS Glue job consuming data from external REST API

大城市里の小女人 提交于 2020-12-06 07:30:11
问题 I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help! 回答1: Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use

AWS Glue job consuming data from external REST API

做~自己de王妃 提交于 2020-12-06 07:28:08
问题 I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help! 回答1: Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use