aws-glue

How to look for updated rows when using AWS Glue?

◇◆丶佛笑我妖孽 提交于 2019-12-02 06:46:50
问题 I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift. As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows. However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of

How to look for updated rows when using AWS Glue?

我的梦境 提交于 2019-12-02 04:57:39
I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift. As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows. However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of Spark, I'd love to hear it as well. You can use the query to find the updated records by filtering data

Using AWS Glue to convert very big csv.gz (30-40 gb each) to parquet

ぃ、小莉子 提交于 2019-12-02 04:11:24
There are lots of such questions but nothing seems to help. I am trying to covert quite large csv.gz files to parquet and keep on getting various errors like 'Command failed with exit code 1' or An error occurred while calling o392.pyWriteDynamicFrame. Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-5-241.eu-central-1.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container marked as failed . In the metrics monitoring I don't see much of CPU

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column

时光总嘲笑我的痴心妄想 提交于 2019-12-02 04:09:40
How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case: Schema Source Table: Col1, Col2 After Glue job. Schema of Destination: Col1, Col2, Update_Date(Current Timestamp) I'm not sure if there's a glue native way to do this with the DynamicFrame , but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below. from pyspark.sql.functions import lit glue_df = glueContext.create_dynamic_frame.from_catalog(...) spark_df =

Scheduling data extraction from AWS Redshift to S3

让人想犯罪 __ 提交于 2019-12-01 22:10:59
I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets. Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS. Consider using AWS Data Pipeline for this. AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are referred to as pipelines. Pipeline contains a business logic of the work required, for example, extracting data

AWS Glue Custom Classifiers Json Path

时光怂恿深爱的人放手 提交于 2019-12-01 19:28:09
I have a set of Json data files that look like this [ {"client":"toys", "filename":"toy1.csv", "file_row_number":1, "secondary_db_index":"4050", "processed_timestamp":1535004075, "processed_datetime":"2018-08-23T06:01:15+0000", "entity_id":"4050", "entity_name":"4050", "is_emailable":false, "is_txtable":false, "is_loadable":false} ] I have created a Glue Crawler with the following custom classifier Json Path $[*] Glue returns the correct schema with the columns correctly identified. However, when I query the data on Athena... all the data is landing in the first column and the rest of the

Upsert from AWS Glue to Amazon Redshift

≡放荡痞女 提交于 2019-12-01 10:55:10
问题 I understand that there is no direct UPSERT query one can perform directly from Glue to Redshift. Is it possible to implement the staging table concept within the glue script itself? So my expectation is creating the staging table, merging it with destination table and finally deleting it. Can it be achieved within the Glue script? 回答1: Yes, it can be totally achievable. All you would need is to import pg8000 module into your glue job. pg8000 module is the python library which is used to make

Overwrite MySQL tables with AWS Glue

随声附和 提交于 2019-12-01 02:52:43
问题 I have a lambda process which occasionally polls an API for recent data. This data has unique keys, and I'd like to use Glue to update the table in MySQL. Is there an option to overwrite data using this key? (Similar to Spark's mode=overwrite). If not - might I be able to truncate the table in Glue before inserting all new data? Thanks 回答1: I ran into the same issue with Redshift, and the best solution we could come up with was to create a Java class that loads the MySQL driver and issues a

AWS Glue Crawler Classifies json file as UNKNOWN

允我心安 提交于 2019-11-30 21:21:33
I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB. I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN. Has anyone else run into this issue? Is there a better way to do this? I have two json files which are 42mb and 16mb, partitioned on S3 as path: s3://bucket

How to turn pip / pypi installed python packages into zip files to be used in AWS Glue

a 夏天 提交于 2019-11-30 19:43:47
问题 I am working with AWS Glue and PySpark ETL scripts, and want to use auxiliary libraries such as google_cloud_bigquery as a part of my PySpark scripts. The documentation states this should be possible. This previous Stack Overflow discussion, especially one comment in one of the answers seems to provide additional proof. However, how to do it is unclear to me. So the goal is to turn the pip install ed packages into one or more zip files, in order to be able to just host the packages on S3 and