Assigning to GenericRecord the timestamp from inner object

删除回忆录丶 提交于 2019-12-11 13:36:32

问题


Processing streaming events and writing files in hourly buckets is a challenge due to windows, as some events from incoming hour can go into previous ones and such.

I've been digging around Apache Beam and its triggers but I'm struggling to manage triggering with timestamp as follows...

Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
                    .triggering(AfterProcessingTime
                     .pastFirstElementInPane()
                     .plusDelayOf(Duration.standardSeconds(1)))
                    .withAllowedLateness(Duration.ZERO)
                    .discardingFiredPanes())

This is what I've been doing so far, triggering 1 min windows no matter what timestamp. However, I would like to include the timestamp within the object so that it gets triggered just for those within.

Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
                .triggering(AfterWatermark
                    .pastEndOfWindow())
                .withAllowedLateness(Duration.ZERO)
                .discardingFiredPanes())

The objects that I'm dealing with have a timestamp object, however, this is a long field and not an Instant field whatsoever.

"{ \"name\": \"timestamp\", \"type\": \"long\", \"logicalType\": \"timestamp-micros\" },"

Having my POJO class with that long field triggers nothing, but if I swap it for an Instant class and recreate the object properly, the following error is thrown whenever a PubSub message is read.

Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Long

I've been also thinking to create a kind of wrapper class around GenericRecord which contains a timestamp, but would need to just use the GenericRecord part within once its ready to write with FileIO to .parquet.

Which other ways do I have to use watermark triggers?

EDIT: After @Anton comments, I've tried the following.

.apply("Apply timestamps", WithTimestamps.of(
            (SerializableFunction<GenericRecord, Instant>) item -> new Instant(Long.valueOf(item.get("timestamp").toString())))
        .withAllowedTimestampSkew(Duration.standardSeconds(30)))

Even it it has been deprecated this seem to pass through the pipeline but still not written (still getting discarded prior writing for some reason by the previously shown trigger?).

And also tried the other mentioned approach using outputWithTimestamp but due to the delay, it's printing the following error...

Caused by: java.lang.IllegalArgumentException: Cannot output with timestamp 2019-06-12T18:59:58.609Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-12T18:59:59.848Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.

来源:https://stackoverflow.com/questions/56565836/assigning-to-genericrecord-the-timestamp-from-inner-object

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!