apache-beam-io | 易学教程

How to specify insertId when spreaming insert to BigQuery using Apache Beam

阅读更多关于 How to specify insertId when spreaming insert to BigQuery using Apache Beam

问题 BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis.

Is there a way to write one file for each record with Apache Beam FileIO?

阅读更多关于 Is there a way to write one file for each record with Apache Beam FileIO?

问题 I am learning Apache Beam and trying to implement something similar to distcp. I use FileIO.read().filepattern() to get the input files, but while writing with FileIO.write, the files get coalesced sometimes. Knowing the partition count before job execution is not possible. PCollection<MatchResult.Metadata> pCollection = pipeline.apply(this.name(), FileIO.match().filepattern(path())) .apply(FileIO.readMatches()) .apply(name(), FileIO.<FileIO.ReadableFile>write() .via(FileSink.create()) .to

Streaming pipelines with BigQuery sinks in python

阅读更多关于 Streaming pipelines with BigQuery sinks in python

问题 I'm building an apache beam streaming pipeline whose source is Pubsub and sink is BigQuery. I've gotten the error messsage: "Workflow failed. Causes: Unknown message code." As cryptic as this message is I now believe it to be the case that BigQuery is not supported as a sink for streaming pipelines, it says this here: Streaming from Pub/Sub to BigQuery Am I certainly correct that this is what's causing the problem? Or if not is it still not supported in any case? Can anyone hint at when this

HTTP Client in DoFn

阅读更多关于 HTTP Client in DoFn

问题 I would like to make POST request through a DoFn for a Apache Beam Pipeline running on Dataflow. For that, I have created a client which instanciate an HttpClosableClient configured on a PoolingHttpClientConnectionManager. However, I instanciate a client for each element that I process. How could I setup a persistent client used by all my elements? And is there other class for parallel and high-speed HTTP requests that I should use? 回答1: You can put the client into a member variable, use the

Apache Beam Python SDK with Pub/Sub source stuck at runtime

阅读更多关于 Apache Beam Python SDK with Pub/Sub source stuck at runtime

问题 I am writing a program in Apache Beam using Python SDK to read from Pub/Sub the contents of a JSON file, and do some processing on the received string. This is the part in the program where I pull contents from Pub/Sub and do the processing: with beam.Pipeline(options=PipelineOptions()) as pipeline: lines = pipeline | beam.io.gcp.pubsub.ReadStringsFromPubSub(subscription=known_args.subscription) lines_decoded = lines | beam.Map(lambda x: x.decode("base64")) lines_split = lines_decoded | (beam

Apache Beam Java SDK SparkRunner write to parquet error

阅读更多关于 Apache Beam Java SDK SparkRunner write to parquet error

问题 I'm using Apache Beam with Java. I'm trying to read a csv file and write it to parquet format using the SparkRunner on a predeployed Spark env, using local mode. Everything worked fine with the DirectRunner, but the SparkRunner simply wont work. I'm using maven shade plugin to build a fat jat. Code is as below: Java: public class ImportCSVToParquet{ -- ommitted File csv = new File(filePath); PCollection<String> vals = pipeline.apply(TextIO.read().from(filePath)); String parquetFilename = csv

Assigning to GenericRecord the timestamp from inner object

阅读更多关于 Assigning to GenericRecord the timestamp from inner object

问题 Processing streaming events and writing files in hourly buckets is a challenge due to windows, as some events from incoming hour can go into previous ones and such. I've been digging around Apache Beam and its triggers but I'm struggling to manage triggering with timestamp as follows... Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1))) .triggering(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(1))) .withAllowedLateness(Duration.ZERO)

Apache Beam with Dataflow - Nullpointer when reading from BigQuery

阅读更多关于 Apache Beam with Dataflow - Nullpointer when reading from BigQuery

问题 I am running a job on google dataflow written with apache beam that reads from BigQuery table and from files. Transforms the data and writes it into other BigQuery tables. The job "usually" succeeds, but sometimes I am randomly getting nullpointer exception when reading from big query table and my job fails: (288abb7678892196): java.lang.NullPointerException at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:98) at com.google.cloud.dataflow.worker.runners

Apache Beam - org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported)

阅读更多关于 Apache Beam - org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported)

问题 I am trying to connect to a hive instance installed in cloud instance using Apache beam-dataflow. When I run this, I am getting the below exception. This is happening when I access this db using Apache beam. I have seen many related questions which is not about apache beam or google dataflow. (c9ec8fdbe9d1719a): java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported) at com.google.cloud

Streaming MutationGroups into Spanner

阅读更多关于 Streaming MutationGroups into Spanner

问题 I'm trying to stream MutationGroups into spanner with SpannerIO. The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's. When I don't use any windows, I get the following error: Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey. at org.apache.beam.sdk.transforms.GroupByKey