aws-glue | 易学教程

How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

阅读更多关于 How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

问题 I have two problems in my intended solution: 1. My S3 store structure is as following: mainfolder/date=2019-01-01/hour=14/abcd.json mainfolder/date=2019-01-01/hour=13/abcd2.json.gz ... mainfolder/date=2019-01-15/hour=13/abcd74.json.gz All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying. I have already tried with just one file format, e.g. if the files are just json or just gz then the crawler works

How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

阅读更多关于 How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

How to load local resource from a python package loaded in AWS PySpark

阅读更多关于 How to load local resource from a python package loaded in AWS PySpark

问题 I have uploaded a python package into AWS EMR with PySpark. My python package has a structure like the following, where I have a resource file (a sklearn joblib model) within the package: myetllib ├── Dockerfile ├── __init__.py ├── modules │ ├── bin │ ├── joblib │ ├── joblib-0.14.1.dist-info │ ├── numpy │ ├── numpy-1.18.4.dist-info │ ├── numpy.libs │ ├── scikit_learn-0.21.3.dist-info │ ├── scipy │ ├── scipy-1.4.1.dist-info │ └── sklearn ├── requirements.txt └── mysubmodule ├── __init__.py ├──

Determine aws region inside a aws glue job

阅读更多关于 Determine aws region inside a aws glue job

问题 Hello I need some help in determining aws region inside a glue job. I am trying to use boto3 client kms and when I do the following I get a Error NoRegionError: You must specify a region. kms = boto3.client('kms') Obviously it is asking me to set region_name when creating the client but I do not wish to hardcode the region When running a glue job i do see a line in the logs which says Detected region us-east-2 but I am not sure on how I can fetch that value ? 回答1: If you're running Pyspark /

How do I handle errors in mapped functions in AWS Glue?

阅读更多关于 How do I handle errors in mapped functions in AWS Glue?

问题 I'm using the map method of DynamicFrame (or, equivalently, the Map.apply method). I've noticed that any errors in the function that I pass to these functions are silently ignored and cause the returned DynamicFrame to be empty. Say I have a job script like this: import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import * glueContext = GlueContext(SparkContext.getOrCreate()) dyF = glueContext.create_dynamic_frame.from_catalog

AWS Glue: Crawler does not recognize Timestamp columns in CSV format

阅读更多关于 AWS Glue: Crawler does not recognize Timestamp columns in CSV format

问题 When running the AWS Glue crawler it does not recognize timestamp columns. I have correctly formatted ISO8601 timestamps in my CSV file. First I expected Glue to automatically classify these as timestamps, which it does not. I also tried a custom timestamp classifier from this link https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html Here is what my classifier looks like This also does not correctly classify my timestamps. I have put into grok debugger (https://grokdebug

AWS Glue: Crawler does not recognize Timestamp columns in CSV format

阅读更多关于 AWS Glue: Crawler does not recognize Timestamp columns in CSV format

Automate bulk loading of data from s3 to Aurora MySQL RDS instance

阅读更多关于 Automate bulk loading of data from s3 to Aurora MySQL RDS instance

问题 I am relatively new to AWS so I am not sure how to go about doing this, I have CSV files on s3 and I have already set up the Aurora instance on RDS. The thing that I am unable to figure out is how do I automate the bulk loading of data, essentially doing like a LOAD DATA FROM s3 kind of thing using something like AWS Glue. I also used the Glue native thing of s3 to RDS, but then it is essentially a bunch of inserts into RDS over a JDBC connection which is also super slow for large datasets. I

ssh into glue dev-endpoint as hadoop user `File '/var/aws/emr/userData.json' cannot be read`

阅读更多关于 ssh into glue dev-endpoint as hadoop user `File '/var/aws/emr/userData.json' cannot be read`

问题 Basically I am trying to solve this problem after setting up my PyCharm to the Glue ETL dev endpoint following this tutorial. java.io.IOException: File '/var/aws/emr/userData.json' cannot be read The above file is owned by hadoop. [glue@ip-xx.xx.xx.xx ~]$ ls -la /var/aws/emr/ total 32 drwxr-xr-x 4 root root 4096 Mar 24 19:35 . drwxr-xr-x 3 root root 4096 Feb 12 2019 .. drwxr-xr-x 3 root root 4096 Feb 12 2019 bigtop-deploy drwxr-xr-x 3 root root 4096 Mar 24 19:35 packages -rw-r--r-- 1 root

AWS Glue automatic job creation

阅读更多关于 AWS Glue automatic job creation

问题 I have pyspark script which I can run in AWS GLUE. But everytime I am creating job from UI and copying my code to the job .Is there anyway I can automatically create job from my file in s3 bucket. (I have all the library and glue context which will be used while running ) 回答1: Another alternative is to use AWS CloudFormation. You can define all AWS resources you want to create (not only Glue jobs) in a template file and then update stack whenever you need from AWS Console or using cli.