aws-glue | 易学教程

Specify a SerDe serialization lib with AWS Glue Crawler

阅读更多关于 Specify a SerDe serialization lib with AWS Glue Crawler

问题 Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe , which doesn't classify correctly (e.g. for quoted fields with commas in) I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde . I've tried making my own csv Classifier but that doesn't help. How do I get the crawler to specify a particular serialization lib for the tables produced or updated? 回答1: You can't

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

aws athena - Create table by an array of json object

阅读更多关于 aws athena - Create table by an array of json object

问题 Can I get help in creating a table on AWS Athena. For a sample example of data : [{"lts": 150}] AWS Glue generate the schema as : array (array<struct<lts:int>>) When I try to use the created table by AWS Glue to preview the table, I had this error: HIVE_BAD_DATA: Error parsing field value for field 0: org.openx.data.jsonserde.json.JSONObject cannot be cast to org.openx.data.jsonserde.json.JSONArray The message error is clear, but I can't find the source of the problem! 回答1: Hive running under

aws athena - Create table by an array of json object

阅读更多关于 aws athena - Create table by an array of json object

How to access data in subdirectories for partitioned Athena table

阅读更多关于 How to access data in subdirectories for partitioned Athena table

问题 I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows: s3://my-bucket/data/2019/06/27/00/00001.json s3://my-bucket/data/2019/06/27/00/00002.json s3://my-bucket/data/2019/06/27/01/00001.json s3://my-bucket/data/2019/06/27/01/00002.json Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data. ALTER TABLE mytable ADD PARTITION (year=2019, month

How to access data in subdirectories for partitioned Athena table

阅读更多关于 How to access data in subdirectories for partitioned Athena table

AWS Glue job consuming data from external REST API

阅读更多关于 AWS Glue job consuming data from external REST API

问题 I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help! 回答1: Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use

AWS Glue job consuming data from external REST API

阅读更多关于 AWS Glue job consuming data from external REST API