apache-beam

start at a given offset with KakfaIO

∥☆過路亽.° 提交于 2019-12-08 10:43:38
问题 I'm using KafkaIO.read() and I'd like to start consuming from a specific offset. At some point there used to be a KafkaIO.read().withStartFromCheckpointMark() method to do that. I see from the documentation that there is a way via: KafkaCheckpointMark provided by runner; How can I do that ? Thanks 回答1: There is no direct support, but there are couple of options: withStartReadTime() might be better suited. You can create a group.id and commit offsets in that group. When you set group.id in

Problem while implementing join of two dataset in google cloud dataflow using Apache Beam

老子叫甜甜 提交于 2019-12-08 05:09:35
问题 I was trying to implement SQL on two dataset on google cloud storage using apache beam by following Apache Beam documentation https://beam.apache.org/documentation/dsls/sql/walkthrough/ But i am ended with the below exception : An exception occured while executing the Java class. org.apache.beam.sdk.transforms.MapElements .via(Lorg/apache/beam/sdk/transforms/SimpleFunction;)Lorg/apache/beam/sdk/transforms/MapElements; I tried changing Beam-sdk-version and other code changes but none of them

Close window after based on element value

旧时模样 提交于 2019-12-08 04:46:40
问题 Is there a way to close a window when an input element has a flag value in the side output of a DoFn? E.g. event which indicates closing of a session closes the window. I've been reading the docs, and triggers are time based mostly. An example would be great. Edit: Trigger.OnElementContext.forTrigger(ExecutableTrigger trigger) seems promising but ExecutableTrigger docs are pretty slim at the moment. 回答1: I don't think that this is available. There is only one Data Driven Trigger right now,

Error while staging packages when a Dataflow job is launched from a fat jar

眉间皱痕 提交于 2019-12-08 04:30:36
问题 I create a maven project to execute a pipeline. If I run the main class, the pipeline works perfectly. If I create a fat jar and I execute it, I have two different errors, one if I execute it under Windows and another one if I execute it under Linux. Under Windows: Exception in thread "main" java.lang.RuntimeException: Error while staging packages at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:364) at org.apache.beam.runners.dataflow.util

Google Cloud dataflow : Read from a file with dynamic filename

我怕爱的太早我们不能终老 提交于 2019-12-08 04:18:05
问题 I am trying to build a pipeline on Google Cloud Dataflow that would do the following: Listen to events on Pubsub subscription Extract the filename from event text Read the file (from Google Cloud Storage bucket) Store the records in BigQuery Following is the code: Pipeline pipeline = //create pipeline pipeline.apply("read events", PubsubIO.readStrings().fromSubscription("sub")) .apply("Deserialise events", //Code that produces ParDo.SingleOutput<String, KV<String, byte[]>>) .apply(TextIO.read

BigqueryIO Unable to Write to Date-Partitioned Table

空扰寡人 提交于 2019-12-08 04:00:18
问题 I am following the instructions in the following post to write to a date-partitioned table in BigQuery. I am using a serializable function to map the the window to a partition-location using the $ syntax and I get the following error: Invalid table ID \"table$19700822\". Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Am I missing something here? Edit adding code: p.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardDays(1)))) .apply

How do I use MapElements and KV in together in Apache Beam?

末鹿安然 提交于 2019-12-08 03:18:56
问题 I wanted to do something like: PCollection<String> a = whatever; PCollection<KV<String, User>> b = a.apply( MapElements.into(TypeDescriptor.of(KV<String, User>.class)) .via(s -> KV.of(s, new User(s)))); Where User is a custom datatype with Arvo coder and a constructor that takes a string into account. However, I get the following error: Cannot select from parameterized type I tried to change it to TypeDescriptor.of(KV.class) instead, but then I get: Incompatible types; Required PCollection>

Optimizing repeated transformations in Apache Beam/DataFlow

强颜欢笑 提交于 2019-12-08 02:14:33
问题 I wonder if Apache Beam.Google DataFlow is smart enough to recognize repeated transformations in the dataflow graph and run them only once. For example, if I have 2 branches: p | GroupByKey() | FlatMap(...) p | combiners.Top.PerKey(...) | FlatMap(...) both will involve grouping elements by key under the hood. Will the execution engine recognize that GroupByKey() has the same input in both cases and run it only once? Or do I need to manually ensure that GroupByKey() in this case proceeds all

Processing with State and Timers

安稳与你 提交于 2019-12-08 02:01:36
问题 Are there any guidelines or limitations for using stateful processing and timers with the Beam Dataflow runner (as of v2.1.0)? Things such as limitations on the size of state or frequency of updates etc.? The candidate streaming pipeline would use state and timers extensively for user session state, with Bigtable as durable storage. 回答1: Here is some general advice for your use case Please aggregate multiple elements then set a timer. Please don't create a timer per element, which would be

Python + Beam + Flink

匆匆过客 提交于 2019-12-08 00:40:47
问题 I've been trying to get the Apache Beam Portability Framework to work with Python and Apache Flink and I can't seem to find a complete set of instructions to get the environment working. Are there any references with complete list of prerequisites and steps to get a simple python pipeline working? 回答1: Overall, for local portable runner (ULR), see the wiki, quote from there: Run a Python-SDK Pipeline: Compile container as a local build: ./gradlew :beam-sdks-python-container:docker Start ULR