How configure glue bookmars to work with scala code?

微笑、不失礼 提交于 2019-12-04 20:00:05

Have you tried setting the transformationContext value to be the same for both the source and the sink? They are currently set to different values in your last update.

transformationContext = "transformationContext1"

and

transformationContext = "transformationContext2"

I have struggled with this as well using Glue and bookmarks. I'm trying to perform a similar task where I read in partitioned JSON files that are partitioned by year, month and day with new files arriving every day. My job runs a transform to pull out a subset of the data and then sink into partitioned Parquet files on S3.

I'm using Python so my initial instantiation of the DynamicFrame looked like this:

dyf = glue_context.create_dynamic_frame.from_catalog(database="dev-db", table_name="raw", transformation_ctx="raw")

And a sink to S3 at the end like this:

glue_context.write_dynamic_frame.from_options( frame=select_out, connection_type='s3', connection_options={'path': output_dir, 'partitionKeys': ['year', 'month', 'day']}, format='parquet', transformation_ctx="dev-transactions" )

Initially I ran the job and the Parquet was generated correctly with bookmarks enabled. I then added a new day of data, updated the partitions on the input table and re-ran. The second job would fail with errors like this:

pyspark.sql.utils.AnalysisException: u"cannot resolve 'year' given input columns: [];;\n'Project ['year, 'month, 'day, 'data']

Changing the transformation_ctx to be the same (dev-transactions in my case) enabled the process to work correctly with only the incremental partitions being processed and Parquet generated for the new partitions.

The documentation is very sparse regarding Bookmarks in general and how the transformation context variable is used.

The Python docs just say: (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html):

transformation_ctx – The transformation context to use (optional).

The Scala docs say (https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-gluecontext.html):

transformationContext — Transformation context associated with the sink to be used by job bookmarks. Set to empty by default.

Best I can observe, since the docs do a poor job of explaining, is that the transformation context is used to form a linkage between what source and sink data have been processed and that having different contexts defined prevents Bookmarks from working as expected.

It looks like the second time the job runs, no new data is found for your catalog

val input = glueContext.getCatalogSource(...)
input.count
# Returns 0, your dynamic frame has no Schema associated to it
# hence the `Partition column my_date not found in schema StructType()`

I'd suggest checking the size of your DynamicFrame or if your partition field exists within the schema of the DynamicFrame input.schema.containsField("my_field") before attempting to map/write it. At that point you could either commit the job or not.

Also if you're certain new data is coming into that catalog on new partitions, you may consider running the Crawler to pick those new partitions or creating them through the API if you don't expect any schema changes.

Hope this helps.

JobBookmarks use the transformation context to key off the state for the given ETL operation (source primarily). Currently having them in the sink does not have any impact.

One of the reasons why jobs fail when job bookmarks are enabled is because they only process incremental data (new files) and if there is no new data, the script would behave as it would when there is no data, which can be spark analysis exception as an example.

So you should not use the same transformation context across different ETL operators.

For you test after the first run, try to copy new data to your source location and run the job again, only new data should be processed.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!