apache-beam | 易学教程

Apache Beam : RabbitMqIO watermark doesn't advance

阅读更多关于 Apache Beam : RabbitMqIO watermark doesn't advance

问题 I need some help please. I'm trying to use Apache beam with RabbitMqIO source (version 2.11.0) and AfterWatermark.pastEndOfWindow trigger. It seems like the RabbitMqIO's watermark doesn't advance and remain the same. Because of this behavior, the AfterWatermark trigger doesn't work. When I use others triggers which doesn't take watermark in consideration, that works (eg: AfterProcessingTime, AfterPane) Below, my code, thanks : public class Main { private static final Logger LOGGER =

How to get the real execution time of a Pipeline and the duration time of start up of the VMs of a Dataflow Job

阅读更多关于 How to get the real execution time of a Pipeline and the duration time of start up of the VMs of a Dataflow Job

问题 I want to get both duration times: the exact time of start up of the Virtual Machines deployed in Compute Engine, and the real execution time of the Pipeline, when a Dataflow Job ends (which is much less than the elapse time showed by the Job in the Dataflow website) I need to get these duration times from Java, and maybe if I can get these values directly from Google Cloud website will be fine also. 来源： https://stackoverflow.com/questions/45122068/how-to-get-the-real-execution-time-of-a

'Timely and stateful' processing possible with Apache Beam Java using Dataflow runner?

阅读更多关于 'Timely and stateful' processing possible with Apache Beam Java using Dataflow runner?

问题 I'm trying to evaluate using Apache Beam (Java SDK) (specifically for Google Cloud's Dataflow runner) for a somewhat complex state-machine workflow. Specifically I want to take advantage of stateful processing and timers as explained in this blogpost: https://beam.apache.org/blog/2017/08/28/timely-processing.html Looking at the capabilities matrix page for Dataflow it says: Timers: "Dataflow supports timers in non-merging windows". Ok that's fine. Stateful processing: "State is supported for

what actually manages watermarks in beam?

阅读更多关于 what actually manages watermarks in beam?

问题 Beam's big power comes from it's advanced windowing capabilities, but it's also a bit confusing. Having seen some oddities in local tests (I use rabbitmq for an input Source) where messages were not always getting ack d, and fixed windows that were not always closing, I started digging around StackOverflow and the Beam code base. It seems there are Source-specific concerns with when exactly watermarks are set: RabbitMQ watermark does not advance: Apache Beam : RabbitMqIO watermark doesn't

Programmatically terminating PubSubIO.readMessages from Subscription after configured time?

阅读更多关于 Programmatically terminating PubSubIO.readMessages from Subscription after configured time?

问题 I am looking to schedule the Dataflow which has PubSubIO.readString from a PubSub topic's subscripton. How can i have the job to be terminating after a configured interval? My usecase is not to keep the job running through the entire day, so looking to schedule to start, and then stop after a configured interval from within the job. Pipeline .apply(PubsubIO.readMessages().fromSubscription("some-subscription")) 回答1: From docs: If you need to stop a running Cloud Dataflow job, you can do so by

Apache Beam: Skipping steps in an already-constructed pipeline

阅读更多关于 Apache Beam: Skipping steps in an already-constructed pipeline

问题 Is there a way to conditionally skip steps in an already-constructed pipeline? Or is pipeline construction designed to be the only way to control which steps are run? 回答1: Normally, pipeline construction controls what transformations in a pipeline will be executed. You can, however, imagine a single input, multiple output ParDo that multiplexes the input PCollection to one of the output PCollection s. Then, by choosing which output to pass your data to, you can dynamically control which steps

Execute multiple queries on BigQuery using ApacheBeam

阅读更多关于 Execute multiple queries on BigQuery using ApacheBeam

问题 I have a file on Google Cloud Storage that contains a number of queries(insert/update/delete/select). I need to do two things: 1) execute all queries 2) for select queries write the result to a file in GCS. What is the most efficient way to do this in Apache Beam? Thank You. 来源： https://stackoverflow.com/questions/45862173/execute-multiple-queries-on-bigquery-using-apachebeam

Apache Beam IllegalArgumentException on Google Dataflow with message `Not expecting a splittable ParDoSingle: should have been overridden`

阅读更多关于 Apache Beam IllegalArgumentException on Google Dataflow with message `Not expecting a splittable ParDoSingle: should have been overridden`

问题 I am trying to write a pipeline which periodically checks a Google Storage bucket for new .gz files which are actually compressed .csv files. Then it writes those records to a BigQuery table. The following code was working in batch mode before I added the .watchForNewFiles(...) and .withMethod(STREAMING_INSERTS) parts. I am expecting it to run in streaming mode with those changes. However I am getting an exception that I can't find anything related on the web. Here is my code: public static

Right way to handle one-to-many stages in Dataflow

阅读更多关于 Right way to handle one-to-many stages in Dataflow

问题 I have a (Java) batch pipeline that has follow the following pattern: (FileIO) (ExtractText > input=1 file, output=millions of lines of text) (ProcessData) The ProcessData stage contains slow parts (matching data against big whitelists) and needs to be scaled on several workers, which should not be an issue since it only contains DoFns. However it would seem that my one-to-many stage forces all the outputs to be processed only by one worker (instantiating more workers makes them all idle

EOFException related to memory segments during run of Beam pipeline on Flink

阅读更多关于 EOFException related to memory segments during run of Beam pipeline on Flink

问题 I'm trying to run an Apache Beam pipeline on Flink on our test cluster. It has been failing with an EOFException at org.apache.flink.runtime.io.disk.SimpleCollectingOutputView:79 during the encoding of an object through serialisation. I haven't been able to reproduce the error locally, yet. You can find the entire job log here. Some values have been replaced with fake data. The command used to run the pipeline: bin/flink run \ -m yarn-cluster \ --yarncontainer 1 \ --yarnslots 4 \ -