apache-beam

Consuming unbounded data in windows with default trigger

落花浮王杯 提交于 2019-12-02 04:10:57
I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow . I use a fixed window and write the aggregates to BigQuery. Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered . And thus the aggregates are not written. Here is my word publisher (it uses kinglear.txt from the examples as input file): public static class AddCurrentTimestampFn extends DoFn<String, String> { @ProcessElement public void

Apache Beam Google Datastore ReadFromDatastore entity protobuf

心已入冬 提交于 2019-12-02 02:42:04
问题 I am trying to use apache beam's google datastore api to ReadFromDatastore p = beam.Pipeline(options=options) (p | 'Read from Datastore' >> ReadFromDatastore(gcloud_options.project, query) | 'reformat' >> beam.Map(reformat) | 'Write To Datastore' >> WriteToDatastore(gcloud_options.project)) The object that gets passed to my reformat function is type google.cloud.proto.datastore.v1.entity_pb2.Entity It is in protobuf format which is hard to modify or read. I think I can convert a entity_pb2

How to write to a file name defined at runtime?

强颜欢笑 提交于 2019-12-02 02:37:30
I want to write to a gs file but I don’t know the file name at compile time. Its name is based on behavior that is defined at runtime. How can I proceed? If you're using Beam Java, you can use FileIO.writeDynamic() for this (starting with Beam 2.3 which is currently in the process of being released - but you can already use it via the version 2.3.0-SNAPSHOT ), or the older DynamicDestinations API (available in Beam 2.2). Example of using FileIO.writeDynamic() to write a PCollection of bank transactions to different paths on GCS depending on the transaction's type: PCollection<BankTransaction>

How to write to a file name defined at runtime?

三世轮回 提交于 2019-12-02 02:30:17
问题 I want to write to a gs file but I don’t know the file name at compile time. Its name is based on behavior that is defined at runtime. How can I proceed? 回答1: If you're using Beam Java, you can use FileIO.writeDynamic() for this (starting with Beam 2.3 which is currently in the process of being released - but you can already use it via the version 2.3.0-SNAPSHOT ), or the older DynamicDestinations API (available in Beam 2.2). Example of using FileIO.writeDynamic() to write a PCollection of

Apache Beam Google Datastore ReadFromDatastore entity protobuf

本小妞迷上赌 提交于 2019-12-02 00:13:21
I am trying to use apache beam's google datastore api to ReadFromDatastore p = beam.Pipeline(options=options) (p | 'Read from Datastore' >> ReadFromDatastore(gcloud_options.project, query) | 'reformat' >> beam.Map(reformat) | 'Write To Datastore' >> WriteToDatastore(gcloud_options.project)) The object that gets passed to my reformat function is type google.cloud.proto.datastore.v1.entity_pb2.Entity It is in protobuf format which is hard to modify or read. I think I can convert a entity_pb2.Entity to a dict with entity= dict(google.cloud.datastore.helpers._property_tuples(entity_pb)) But for

At what stage does Dataflow/Apache Beam ack a pub/sub message?

ⅰ亾dé卋堺 提交于 2019-12-01 23:15:38
问题 I have a dataflow streaming job with Pub/Sub subscription as an unbounded source. I want to know at what stage does dataflow acks the incoming pub/sub message. It appears to me that the message is lost if an exception is thrown during any stage of the dataflow pipeline. Also I'd like to know how to the best practices for writing dataflow pipeline with pub/sub unbounded source for message retrieval on failure. Thank you! 回答1: The Dataflow Streaming Runner acks pubsub messages received by a

Stateful indexing causes ParDo to be run single-threaded on Dataflow Runner

試著忘記壹切 提交于 2019-12-01 22:56:14
问题 We're generating a sequential index in a ParDo using Beam's Java SDK 2.0.0. Just like the simple stateful index example in Beam's introduction to stateful processing we use a ValueState<Integer> cell and our only operation on it is to retrieve the value and increment when we need the next index: Integer statefulIndex = firstNonNull(index.read(), 0); index.write(statefulIndex + 1); When running with Google's Dataflow runner, we noticed on the Dataflow monitoring interface that the wall time

At what stage does Dataflow/Apache Beam ack a pub/sub message?

时间秒杀一切 提交于 2019-12-01 21:47:53
I have a dataflow streaming job with Pub/Sub subscription as an unbounded source. I want to know at what stage does dataflow acks the incoming pub/sub message. It appears to me that the message is lost if an exception is thrown during any stage of the dataflow pipeline. Also I'd like to know how to the best practices for writing dataflow pipeline with pub/sub unbounded source for message retrieval on failure. Thank you! The Dataflow Streaming Runner acks pubsub messages received by a bundle after the bundle has succeeded and results of the bundle (outputs and state mutations etc) have been

BigQuery writeTableRows Always writing to buffer

。_饼干妹妹 提交于 2019-12-01 14:42:42
We are trying to write to Big Query using Apache Beam and avro. The following seems to work ok:- p.apply("Input", AvroIO.read(DataStructure.class).from("AvroSampleFile.avro")) .apply("Transform", ParDo.of(new CustomTransformFunction())) .apply("Load", BigQueryIO.writeTableRows().to(table).withSchema(schema)); We then tried to use it in the following manner to get data from the Google Pub/Sub p.begin() .apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName")) .apply("Transform", ParDo.of(new CustomTransformFunction())) .apply("Write", BigQueryIO.writeTableRows() .to(table)

GroupIntoBatches for non-KV elements

半腔热情 提交于 2019-12-01 13:00:30
According to the Apache Beam 2.0.0 SDK Documentation GroupIntoBatches works only with KV collections. My dataset contains only values and there's no need for introducing keys. However, to make use of GroupIntoBatches I had to implement “fake” keys with an empty string as a key: static class FakeKVFn extends DoFn<String, KV<String, String>> { @ProcessElement public void processElement(ProcessContext c) { c.output(KV.of("", c.element())); } } So the overall pipeline looks like the following: public static void main(String[] args) { PipelineOptions options = PipelineOptionsFactory.create();