apache-beam

A simple counting step following a group by key is extremely slow in a DataFlow pipeline

若如初见. 提交于 2019-12-05 03:55:06
问题 I have a DataFlow pipeline trying to build an index (key-value pairs) and compute some metrics (like a number of values per key). The input data is about 60 GB total, stored on GCS and the pipeline has about 126 workers allocated. Per Stackdriver all workers have about 6% CPU utilization. The pipeline seems to make no progress despite having 126 workers and based on the wall time the bottleneck seems to be a simple counting step that follows a group by. While all other steps have on average

Apache Beam Counter/Metrics not available in Flink WebUI

心已入冬 提交于 2019-12-05 01:29:08
I'm using Flink 1.4.1 and Beam 2.3.0, and would like to know is it possible to have metrics available in Flink WebUI (or anywhere at all), as in Dataflow WebUI ? I've used counter like: import org.apache.beam.sdk.metrics.Counter; import org.apache.beam.sdk.metrics.Metrics; ... Counter elementsRead = Metrics.counter(getClass(), "elements_read"); ... elementsRead.inc(); but I can't find "elements_read" counts available anywhere (Task Metrics or Accumulators) in Flink WebUI. I thought this will be straightforward after BEAM-773 . Once you have selected a job in your dashboard, you will see the

Stateful processing in Beam - is state shared across window panes?

老子叫甜甜 提交于 2019-12-04 16:36:41
Apache Beam has recently introduced state cells, through StateSpec and the @StateId annotation, with partial support in Apache Flink and Google Cloud Dataflow. My question is about state garbage collection, in the case where a stateful DoFn is used on a windowed stream. Typically, state is removed (garbage collected) by the runner when the window expires (i.e. the watermark passes the end of the window). However, consider the case where window panes are triggered early, and the fired panes are discarded: input.apply(Window.<MyElement>into(CalendarWindows.days(1)) .triggering(AfterWatermark

TextIO. Read multiple files from GCS using pattern {}

佐手、 提交于 2019-12-04 16:19:44
I tried using the following TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv") That pattern didn't work, as I get java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv} Even though those 2 files do exist. And I tried with a local file using a similar expression TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv") And that did work just fine. I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm

Difference between beam.ParDo and beam.Map in the output type?

回眸只為那壹抹淺笑 提交于 2019-12-04 15:02:59
I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am reading csv data, and in the first case pass it to a DoFn using a beam.ParDo , which extracts the first element which is the date, then print it. In the second case, I directly use beam.Map to do the same thing, then print it. class Printer(beam.DoFn): def process(self,data_item): print data_item class DateExtractor(beam.DoFn): def process(self,data_item

Early results from GroupByKey transform

旧城冷巷雨未停 提交于 2019-12-04 14:34:49
How can I get GroupByKey to trigger early results, rather than wait for all the data to arrive (which in my case is a pretty long time).I tried to split my input PCollection into windows with an early trigger, but it just doesn`t work. It still waits for all the data to arrive before giving out the results. PCollection<List<String>> input = ... PCollection<KV<Integer,List<String>>> keyedInput = input.apply(ParDo.of(new AddArbitraryKey())) keyedInput.apply(Window<KV<Integer,List<String>>>into( FixedWindows.of(Duration.standardSeconds(1))) .triggering(Repeatedly.forever(AfterWatermark

How to create groups of N elements from a PCollection Apache Beam Python

时间秒杀一切 提交于 2019-12-04 14:29:02
问题 I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some help getting a similar construction. Specifically I have this: p = beam.Pipeline (options = pipeline_options) lines = p | 'File reading' >> ReadFromText (known_args.input) After this, I need to create another PCollection but with a List of N rows of "lines" since my use case requires a group of rows. I

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

十年热恋 提交于 2019-12-04 14:06:52
I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception : org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey Is anything wrong with this snippet code? If I use .discardingFiredPanes() instead, I will lose information in the last emit. pipeline .apply(GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1L))) .apply( Window.<Long>into

Processing Total Ordering of Events By Key using Apache Beam

我是研究僧i 提交于 2019-12-04 11:26:40
Problem Context I am trying to generate a total (linear) order of event items per key from a real-time stream where the order is event time (derived from the event payload). Approach I had attempted to implement this using streaming as follows: 1) Set up a non overlapping sequential windows, e.g. duration 5 minutes 2) Establish an allowed lateness - it is fine to discard late events 3) Set accumulation mode to retain all fired panes 4) Use the "AfterwaterMark" trigger 5) When handling a triggered pane, only consider the pane if it is the final one 6) Use GroupBy.perKey to ensure all events in

KafkaIO checkpoint - how to commit offsets to Kafka

戏子无情 提交于 2019-12-04 07:24:17
I'm running a job using Beam KafkaIO source in Google Dataflow and cannot find an easy way to persist offsets across job restarts (job update option is not enough, i need to restart the job) Comparing Beam's KafkaIO against PubSubIO (or to be precise comparing PubsubCheckpoint with KafkaCheckpointMark) I can see that checkpoint persistence is not implemented in KafkaIO (KafkaCheckpointMark.finalizeCheckpoint method is empty) whereas it's implemented in PubsubCheckpoint.finalizeCheckpoint which does acknowledgement to PubSub. Does this mean I have no means of reliably managing Kafka offsets on