apache-beam

Forcing an empty pane/window in streaming in Apache Beam

好久不见. 提交于 2019-12-01 12:11:19
I am trying to implement a pipeline and takes in a stream of data and every minutes output a True if there is any element in the minute interval or False if there is none. The pane (with forever time trigger) or window (fixed window) does not seem to trigger if there is no element for the duration. One workaround I am thinking is to put the stream into a global window, use a ValueState to keep a queue to accumulate the data and a timer as a trigger to exam the queue. I wonder if there is any neater way of achieving this. Thanks. Alex Amato I think your timers and state solution is a good way

Performance issues on Dataflow batch loads using Apache Beam

会有一股神秘感。 提交于 2019-12-01 12:04:37
问题 I was doing a performance benchmarking of dataflow batch loads and found that the loads were just too slow when compared against the same loads on Bigquery command line tool. The file size was around 20 MB with millions of records. I tried different machine types and got the best load performance on n1-highmem-4 with the approx load time of 8 minutes in loading the target BQ table. When the same table load was applied by running BQ command on the command-line utility, it hardly took 2 minutes

IN Apache Beam how to handle exceptions/errors at Pipeline-IO level

梦想与她 提交于 2019-12-01 11:49:16
问题 i am using running spark runner as pipeline runner in apache beam and found an error. by getting the error, my question araised. I know the error was due to incorrect Column_name in sql query but my question is how to handle an error/exception at IO level org.apache.beam.sdk.util.UserCodeException: java.sql.SQLSyntaxErrorException: Unknown column 'FIRST_NAME' in 'field list' at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36) at org.apache.beam.sdk.io.jdbc.JdbcIO

How to Deserialising Kafka AVRO messages using Apache Beam

我们两清 提交于 2019-12-01 11:43:32
问题 The main goal is the aggregate two Kafka topics, one compacted slow moving data and the other fast moving data which is received every second. I have been able to consume messages in simple scenarios such as a KV (Long,String) using something like: PCollection<KV<Long,String>> input = p.apply(KafkaIO.<Long, String>read() .withKeyDeserializer(LongDeserializer.class) .withValueDeserializer(StringDeserializer.class) PCollection<String> output = input.apply(Values.<String>create()); But this

Apache Beam Pipeline Query Table After Writing Table

∥☆過路亽.° 提交于 2019-12-01 11:29:23
I have a Apache Beam/Dataflow pipeline that is writing results to a BigQuery table. I would then like to query this table for a separate portion of the pipeline. However, I can't seem to figure out how to properly set up this pipeline dependency. The new table that I write (and then want to query) is left joined with a separate table for some filtering logic and that is why I actually need to write the table and then run the query. The logic would be as follows: with beam.Pipeline(options=pipeline_options) as p: table_data = p | 'CreatTable' >> # ... logic to generate table ... # Write Table

GroupIntoBatches for non-KV elements

隐身守侯 提交于 2019-12-01 10:36:38
问题 According to the Apache Beam 2.0.0 SDK Documentation GroupIntoBatches works only with KV collections. My dataset contains only values and there's no need for introducing keys. However, to make use of GroupIntoBatches I had to implement “fake” keys with an empty string as a key: static class FakeKVFn extends DoFn<String, KV<String, String>> { @ProcessElement public void processElement(ProcessContext c) { c.output(KV.of("", c.element())); } } So the overall pipeline looks like the following:

Apache Beam Pipeline Query Table After Writing Table

孤人 提交于 2019-12-01 09:41:38
问题 I have a Apache Beam/Dataflow pipeline that is writing results to a BigQuery table. I would then like to query this table for a separate portion of the pipeline. However, I can't seem to figure out how to properly set up this pipeline dependency. The new table that I write (and then want to query) is left joined with a separate table for some filtering logic and that is why I actually need to write the table and then run the query. The logic would be as follows: with beam.Pipeline(options

Error using SpannerIO in apache beam

我们两清 提交于 2019-12-01 09:16:25
This question is a follow-up to this one . I am trying to use apache beam to read data from a google spanner table (and then do some data processing). I wrote the following minimum example using the java SDK: package com.google.cloud.dataflow.examples; import java.io.IOException; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.gcp.spanner.SpannerIO; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.values.PCollection; import com.google.cloud

How I can convert PCollection to a list in python dataflow

瘦欲@ 提交于 2019-12-01 09:12:40
问题 I have a PCollection P1 that contains a field of ID's . I want to take the complete ID's column from the PCollection as a list and pass this value to a BigQuery query for filtering one BigQuery table. What would be the fastest and most optimized way for doing this? I'm new to Dataflow and BigData. Can any one give some hints on this? Thanks! 回答1: For what I understood from your question you want to build the SQL statement given the IDs you have in P1 . This is one example of how you can

How to specify insertId when spreaming insert to BigQuery using Apache Beam

天涯浪子 提交于 2019-12-01 08:56:46
BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis. You might have to retry an insert because there's no way to determine the state of a streaming insert