aws-glue | 易学教程

AWS Glue predicate push down condition has no effect

阅读更多关于 AWS Glue predicate push down condition has no effect

问题 I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows datasource = glueContext.create_dynamic_frame_from_catalog( database = source_catalog_db, table_name = source_catalog_tbl, push_down_predicate = "id > 1531812324", transformation_ctx = "datasource") I am always getting all the records in 'datasource' whatever the condition I put in 'push_down_predicate'. What am I missing? 回答1: Pushdown predicate works for partitioning columns

How to list all databases and tables in AWS Glue Catalog?

阅读更多关于 How to list all databases and tables in AWS Glue Catalog?

问题 I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work. What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success. 回答1: I spend several hours trying to find some info about CatalogConnection class but haven

use SQL inside AWS Glue pySpark script

阅读更多关于 use SQL inside AWS Glue pySpark script

问题 I want to use AWS Glue to convert some csv data to orc. The ETL job I created generated the following PySpark script: import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'

HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

阅读更多关于 HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

I'm getting an error when running an Athena query against a Glue table created from an RDS database: HIVE_UNKNOWN_ERROR: Unable to create input format The tables are created using a crawler. The tables show up correctly in the Glue interface: However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables" I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea? I had the same problem. This is the answer that I have got from AWS Support: I understand that you

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

阅读更多关于 Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

问题 I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided. AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark

Overwrite parquet files from dynamic frame in AWS Glue

阅读更多关于 Overwrite parquet files from dynamic frame in AWS Glue

问题 I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame.from_options(frame = table, connection_type = "s3", connection_options = {"path": output_dir, "partitionKeys": ["var1","var2"]}, format = "parquet") Is there anything like "mode":"overwrite" that replace my parquet files? 回答1: Currently AWS Glue doesn't support 'overwrite' mode but they are working

AWS Glue - can't set spark.yarn.executor.memoryOverhead

阅读更多关于 AWS Glue - can't set spark.yarn.executor.memoryOverhead

When running a python job in AWS Glue I get the error: Reason: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead When running this in the beginning of the script: print '--- Before Conf --' print 'spark.yarn.driver.memory', sc._conf.get('spark.yarn.driver.memory') print 'spark.yarn.driver.cores', sc._conf.get('spark.yarn.driver.cores') print 'spark.yarn.executor.memory', sc._conf.get('spark.yarn.executor.memory') print 'spark.yarn.executor.cores', sc._conf.get('spark.yarn.executor.cores') print

AWS Glue Crawler Not Creating Table

阅读更多关于 AWS Glue Crawler Not Creating Table

问题 I have a crawler I created in AWS Glue that does not create a table in the Data Catalog after it successfully completes. The crawler takes roughly 20 seconds to run and the logs show it successfully completed. CloudWatch log shows: Benchmark: Running Start Crawl for Crawler Benchmark: Classification Complete, writing results to DB Benchmark: Finished writing to Catalog Benchmark: Crawler has finished running and is in ready state I am at a loss as to why the tables in the data catalog are not

How to list all databases and tables in AWS Glue Catalog?

阅读更多关于 How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work. What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success. I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs ) In my case

AWS Glue: crawler misinterprets timestamps as strings. GLUE ETL meant to convert strings to timestamps makes them NULL

阅读更多关于 AWS Glue: crawler misinterprets timestamps as strings. GLUE ETL meant to convert strings to timestamps makes them NULL

问题 I have been playing around with AWS Glue for some quick analytics by following the tutorial here While I have been able to successfully create crawlers and discover data in Athena, I've had issues with the data types created by the crawler. The date and timestamp data types get read as string data types. I followed this up by creating an ETL job in GLUE using the data source created by the crawler as the input and a target table in Amazon S3 As part of the mapping transformation, I converted