DataFlow (PY 2.x SDk) ReadFromPubSub :: id_label & timestamp_attribute behaving unexpectedly

旧时模样 提交于 2019-12-08 12:30:28

问题


My apache beam pipeline (using Python SDK+ DirecrRunner for testing purpose…) is reading from Pubsub topic

The message & attributes published are as follows:

message: [{"col1": "test column 1", "col2": "test column 1"}]
attributes:{
  'event_time_v1': str(time.time()),
  'record_id': 'row-1’,
}

I’m using the function beam.io.gcp.pubsub.ReadFromPubSub. The code/doc mentions id_label and timestamp_attribute arguments (I believe these are very new additions?! Updated only 13 days ago..)

  1. When I use id_label in order to assign each element a unique id for dedupe purpose, I get following error:

NotImplementedError: DirectRunner: id_label is not supported for PubSub reads```

why so? am I correct in my understanding that some code implementation is still missing or am I missing something here ?

  1. When I use timestamp_attribute = 'event_time_v1’, in order to assign my own timestamp to each element (client side event time passed in message attribute event_time_v1), I notice timestamp actually assigned to the element is still the message publish time

why so? I expected it would be the time passed in event_time_v1

I'm using following DoFn to print element's timestamp

class PrintFn(beam.DoFn):

      print(element, timestamp)
      return [element]

Thanks a lot in advance for any explanation to that


回答1:


I have had the same problem with this today, there is actually an open issue on Jira for id_label and timestamp_attribute being unavailable in the direct runner (and I'm assuming from reading, any non dataflow runners). I've successfully been able to use id_label when specifying DataflowRunner as the runner (with some other issues, but that's by the by).

The Jira issue is below:

https://issues.apache.org/jira/browse/BEAM-4275?jql=text%20~%20%22python%20id_label%22

So, at the moment, it would appear this is not yet possible to do using the direct runner.



来源:https://stackoverflow.com/questions/53036548/dataflow-py-2-x-sdk-readfrompubsub-id-label-timestamp-attribute-behaving

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!