aws-glue

AWS Glue executor memory limit

元气小坏坏 提交于 2019-11-30 18:46:06
I found that AWS Glue set up executor's instance with memory limit to 5 Gb --conf spark.executor.memory=5g and some times, on a big datasets it fails with java.lang.OutOfMemoryError . The same is for driver instance --spark.driver.memory=5g . Is there any option to increase this value? The official glue documentation suggests that glue doesn't support custom spark config. There are also several argument names used by AWS Glue internally that you should never set: --conf — Internal to AWS Glue. Do not set! --debug — Internal to AWS Glue. Do not set! --mode — Internal to AWS Glue. Do not set! -

AWS Glue pricing against AWS EMR

拥有回忆 提交于 2019-11-30 14:08:29
I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests. On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days. On calculating for a month, I see that AWS Glue works out to be around $14.64, whereas for EMR it works out to be around

AWS Glue - Truncate destination postgres table prior to insert

回眸只為那壹抹淺笑 提交于 2019-11-30 13:54:08
I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE. Has anyone been able to do so? I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000 . Download the tar of pg8000 from pypi Create an empty __init__.py in the root folder Zip up the contents & upload to S3 Reference the zip file in the Python lib path of the job Set the DB connection details as job params (make sure to

AWS Glue takes a long time to finish

大兔子大兔子 提交于 2019-11-30 12:31:16
I just run a very simple job as follows glueContext = GlueContext(SparkContext.getOrCreate()) l_table = glueContext.create_dynamic_frame.from_catalog( database="gluecatalog", table_name="fctable") l_table = l_table.drop_fields(['seq','partition_0','partition_1','partition_2','partition_3']).rename_field('tbl_code','table_code') print "Count: ", l_table.count() l_table.printSchema() l_table.select_fields(['trans_time']).toDF().distinct().show() dfc = l_table.relationalize("table_root", "s3://my-bucket/temp/") print "Before keys() call " dfc.keys() print "After keys() call " l_table.select

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

老子叫甜甜 提交于 2019-11-30 06:23:32
As per this AWS Forum Thread , does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)? At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH' My use case is: Partitions represent days Files represent events Each event is a json blob in a single s3 file An event contains a subset of columns (dependent on the type of event) The 'schema' of the entire table is the full set of columns for all the

AWS Glue: How to handle nested JSON with varying schemas

百般思念 提交于 2019-11-30 00:45:01
Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum. Background: The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely. Ideally, we would like to use Glue to

AWS Glue takes a long time to finish

和自甴很熟 提交于 2019-11-29 17:51:50
问题 I just run a very simple job as follows glueContext = GlueContext(SparkContext.getOrCreate()) l_table = glueContext.create_dynamic_frame.from_catalog( database="gluecatalog", table_name="fctable") l_table = l_table.drop_fields(['seq','partition_0','partition_1','partition_2','partition_3']).rename_field('tbl_code','table_code') print "Count: ", l_table.count() l_table.printSchema() l_table.select_fields(['trans_time']).toDF().distinct().show() dfc = l_table.relationalize("table_root", "s3:/

How to run glue script from Glue Dev Endpoint

天大地大妈咪最大 提交于 2019-11-29 08:30:46
I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ? I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal. You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python ). e.g. radix@localhost:~$ DEV_ENDPOINT=glue@ec2-w-x-y-z.compute-1

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

天大地大妈咪最大 提交于 2019-11-29 01:42:24
问题 As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)? At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH' My use case is: Partitions represent days Files represent events Each event is a json blob in a single s3 file An event contains a subset of columns

Use AWS Glue Python with NumPy and Pandas Python Packages

一个人想着一个人 提交于 2019-11-28 11:05:49
What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas. I think the current answer is you cannot . According to AWS Glue Documentation : Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported. But even when I try to include a normal python written library in S3, the Glue job failed because of some HDFS permission problem. If you find a way to solve