问题
My apache beam pipeline (using Python SDK+ DirecrRunner for testing purpose…) is reading from Pubsub topic
The message & attributes published are as follows:
message: [{"col1": "test column 1", "col2": "test column 1"}]
attributes:{
'event_time_v1': str(time.time()),
'record_id': 'row-1’,
}
I’m using the function beam.io.gcp.pubsub.ReadFromPubSub. The code/doc mentions id_label
and timestamp_attribute
arguments (I believe these are very new additions?! Updated only 13 days ago..)
- When I use
id_label
in order to assign each element a unique id for dedupe purpose, I get following error:
NotImplementedError: DirectRunner: id_label is not supported for PubSub reads```
why so? am I correct in my understanding that some code implementation is still missing or am I missing something here ?
- When I use
timestamp_attribute = 'event_time_v1’
, in order to assign my own timestamp to each element (client side event time passed in message attributeevent_time_v1
), I notice timestamp actually assigned to the element is still the message publish time
why so? I expected it would be the time passed in event_time_v1
I'm using following DoFn to print element's timestamp
class PrintFn(beam.DoFn):
print(element, timestamp)
return [element]
Thanks a lot in advance for any explanation to that
回答1:
I have had the same problem with this today, there is actually an open issue on Jira for id_label and timestamp_attribute being unavailable in the direct runner (and I'm assuming from reading, any non dataflow runners). I've successfully been able to use id_label when specifying DataflowRunner as the runner (with some other issues, but that's by the by).
The Jira issue is below:
https://issues.apache.org/jira/browse/BEAM-4275?jql=text%20~%20%22python%20id_label%22
So, at the moment, it would appear this is not yet possible to do using the direct runner.
来源:https://stackoverflow.com/questions/53036548/dataflow-py-2-x-sdk-readfrompubsub-id-label-timestamp-attribute-behaving