问题
Consider scala code:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.{GlueArgParser, Job, JsonOptions}
import org.apache.spark.SparkContext
import scala.collection.JavaConverters.mapAsJavaMapConverter
object MyGlueJob {
def main(sysArgs: Array[String]) {
val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val input = glueContext
.getCatalogSource(database = "my_data_base", tableName = "my_json_gz_partition_table")
.getDynamicFrame()
val processed = input.applyMapping(
Seq(
("id", "string", "id", "string"),
("my_date", "string", "my_date", "string")
))
glueContext.getSinkWithFormat(
connectionType = "s3",
options = JsonOptions(Map("path" -> "s3://my_path", "partitionKeys" -> List("my_date"))),
format = "orc", transformationContext = ""
).writeDynamicFrame(processed)
Job.commit
}
}
The input is partitioned json file with gzip compression which are partitioned by date column. Everything works - the data is read in json format and written in orc.
But when try to run job with same data it read it again and writes duplicated data. The bookmarks is enabled in this job. Methos Job.init
and Job.commit
are invocated. What is wrong?
UPDATED
I have added a transformationContext
parameter to getCatalogSource
and getSinkWithFormat
:
val input = glueContext
.getCatalogSource(database = "my_data_base", tableName = "my_json_gz_partition_table", transformationContext = "transformationContext1")
.getDynamicFrame()
and:
glueContext.getSinkWithFormat(
connectionType = "s3",
options = JsonOptions(Map("path" -> "s3://my_path", "partitionKeys" -> List("my_date"))),
format = "orc", transformationContext = "transformationContext2"
).writeDynamicFrame(processed)
Now magic "works" in that way:
- First run - ok
- Second run (with same data or same data and new one) - it fails with error (later on)
Again the error happens after second (and subsequent) runs.
Also the message Skipping Partition {"my_date": "2017-10-10"}
appears in logs.
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Partition column my_date not found in schema StructType(); org.apache.spark.sql.AnalysisException: Partition column my_date not found in schema StructType();
at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$11.apply(PartitioningUtils.scala:439)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$11.apply(PartitioningUtils.scala:439)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:438)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:437)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:437)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:420)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:443)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at com.amazonaws.services.glue.SparkSQLDataSink.writeDynamicFrame(DataSink.scala:123)
at MobileArcToRaw$.main(script_2018-01-18-08-14-38.scala:99)
What is really going on with glue bookmarks??? Oo
回答1:
Have you tried setting the transformationContext
value to be the same for both the source and the sink? They are currently set to different values in your last update.
transformationContext = "transformationContext1"
and
transformationContext = "transformationContext2"
I have struggled with this as well using Glue and bookmarks. I'm trying to perform a similar task where I read in partitioned JSON files that are partitioned by year, month and day with new files arriving every day. My job runs a transform to pull out a subset of the data and then sink into partitioned Parquet files on S3.
I'm using Python so my initial instantiation of the DynamicFrame looked like this:
dyf = glue_context.create_dynamic_frame.from_catalog(database="dev-db", table_name="raw", transformation_ctx="raw")
And a sink to S3 at the end like this:
glue_context.write_dynamic_frame.from_options(
frame=select_out,
connection_type='s3',
connection_options={'path': output_dir, 'partitionKeys': ['year', 'month', 'day']},
format='parquet',
transformation_ctx="dev-transactions"
)
Initially I ran the job and the Parquet was generated correctly with bookmarks enabled. I then added a new day of data, updated the partitions on the input table and re-ran. The second job would fail with errors like this:
pyspark.sql.utils.AnalysisException: u"cannot resolve 'year' given input columns: [];;\n'Project ['year, 'month, 'day, 'data']
Changing the transformation_ctx
to be the same (dev-transactions
in my case) enabled the process to work correctly with only the incremental partitions being processed and Parquet generated for the new partitions.
The documentation is very sparse regarding Bookmarks in general and how the transformation context variable is used.
The Python docs just say: (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html):
transformation_ctx – The transformation context to use (optional).
The Scala docs say (https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-gluecontext.html):
transformationContext — Transformation context associated with the sink to be used by job bookmarks. Set to empty by default.
Best I can observe, since the docs do a poor job of explaining, is that the transformation context is used to form a linkage between what source and sink data have been processed and that having different contexts defined prevents Bookmarks from working as expected.
回答2:
It looks like the second time the job runs, no new data is found for your catalog
val input = glueContext.getCatalogSource(...)
input.count
# Returns 0, your dynamic frame has no Schema associated to it
# hence the `Partition column my_date not found in schema StructType()`
I'd suggest checking the size of your DynamicFrame or if your partition field exists within the schema of the DynamicFrame input.schema.containsField("my_field")
before attempting to map/write it. At that point you could either commit the job or not.
Also if you're certain new data is coming into that catalog on new partitions, you may consider running the Crawler to pick those new partitions or creating them through the API if you don't expect any schema changes.
Hope this helps.
回答3:
JobBookmarks use the transformation context to key off the state for the given ETL operation (source primarily). Currently having them in the sink does not have any impact.
One of the reasons why jobs fail when job bookmarks are enabled is because they only process incremental data (new files) and if there is no new data, the script would behave as it would when there is no data, which can be spark analysis exception as an example.
So you should not use the same transformation context across different ETL operators.
For you test after the first run, try to copy new data to your source location and run the job again, only new data should be processed.
来源:https://stackoverflow.com/questions/48314601/how-configure-glue-bookmars-to-work-with-scala-code