aws-glue

Partitioning Athena Tables from Glue Cloudformation template

风格不统一 提交于 2021-02-19 06:28:26
问题 Using AWS::Glue::Table, you can set up an Athena table like here. Athena supports partitioning data based on folder structure in S3. I would like to partition my Athena table from my Glue template. From AWS Glue Table TableInput, it appears that I can use PartitionKeys to partition my data, but when I try to use the below template, Athena fails and can't get any data. Resources: ... MyGlueTable: Type: AWS::Glue::Table Properties: DatabaseName: !Ref MyGlueDatabase CatalogId: !Ref AWS:

Wait until AWS Glue crawler has finished running

℡╲_俬逩灬. 提交于 2021-02-19 05:30:38
问题 In the documentation, I cannot find any way of checking the run status of a crawler. The only way I am doing it currently is constantly checking AWS to check if the file/table has been created. Is there a better way to block until crawler finishes its run? 回答1: You can use boto3 (or similar) to do it. There is the get_crawler method. You will find needed information in "LastCrawl" section https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get

Wait until AWS Glue crawler has finished running

ぐ巨炮叔叔 提交于 2021-02-19 05:30:17
问题 In the documentation, I cannot find any way of checking the run status of a crawler. The only way I am doing it currently is constantly checking AWS to check if the file/table has been created. Is there a better way to block until crawler finishes its run? 回答1: You can use boto3 (or similar) to do it. There is the get_crawler method. You will find needed information in "LastCrawl" section https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get

Athena puts data in incorrect columns when input data format changes

这一生的挚爱 提交于 2021-02-17 05:38:30
问题 We have some pipe delimited .txt reports coming into a folder in S3, on which we run Glue crawler to determine the schema and query in Athena. The format of the report changed recently so there are two new columns in the middle. Old files: Columns A B C D E F Data a1 b1 c1 d1 e1 f1 New files with extra "G" and "H" columns: Columns A B G H C D E F Data a2 b2 g2 h2 c2 d2 e2 f2 What we get in the table created by the crawler as seen in Athena: Columns A B C D E F G H <- Puts new columns at the

Spark - Read and Write back to same S3 location

a 夏天 提交于 2021-02-17 02:48:10
问题 I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from. However, I get below error message: An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine. my_df \ .write.mode('overwrite') \ .format('parquet') \

How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script

心已入冬 提交于 2021-02-11 15:05:00
问题 I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31). { "2018" : { "Jan": { "1": { "u": 1, "n": 2 } "2": { "u": 4, "n": 7 } }, "Feb": { "1": { "u": 3, "n": 2 }, "4": { "u": 4, "n": 5 } } } } I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure: dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table,

AWS Glue: Rename_field() does not work after relationalize

微笑、不失礼 提交于 2021-02-10 15:00:58
问题 I got a job that needs to perform the following task Relationalize the data Rename the field names that contains the '.'s so that it can be imported into PostgreSQL as normal looking field name. Here is the code import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc =

AWS Glue: Rename_field() does not work after relationalize

白昼怎懂夜的黑 提交于 2021-02-10 14:58:42
问题 I got a job that needs to perform the following task Relationalize the data Rename the field names that contains the '.'s so that it can be imported into PostgreSQL as normal looking field name. Here is the code import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc =

How can I use an external python library in AWS Glue?

烂漫一生 提交于 2021-02-07 04:01:29
问题 First stack overflow question here. Hope I do this correctly: I need to use an external python library in AWS glue. "Openpyxl" is the name of the library. I follow these directions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html However, after I have my zip file saved in the correct s3 location and point my glue job to that location, I'm not sure what to actually write in the script. I tried your typical Import openpyxl , but that just returns the

How can I use an external python library in AWS Glue?

末鹿安然 提交于 2021-02-07 03:59:42
问题 First stack overflow question here. Hope I do this correctly: I need to use an external python library in AWS glue. "Openpyxl" is the name of the library. I follow these directions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html However, after I have my zip file saved in the correct s3 location and point my glue job to that location, I'm not sure what to actually write in the script. I tried your typical Import openpyxl , but that just returns the