apache-beam-io

How to specify insertId when spreaming insert to BigQuery using Apache Beam

柔情痞子 提交于 2019-12-19 09:25:05
问题 BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis.

Is there a way to write one file for each record with Apache Beam FileIO?

点点圈 提交于 2019-12-18 09:54:08
问题 I am learning Apache Beam and trying to implement something similar to distcp. I use FileIO.read().filepattern() to get the input files, but while writing with FileIO.write, the files get coalesced sometimes. Knowing the partition count before job execution is not possible. PCollection<MatchResult.Metadata> pCollection = pipeline.apply(this.name(), FileIO.match().filepattern(path())) .apply(FileIO.readMatches()) .apply(name(), FileIO.<FileIO.ReadableFile>write() .via(FileSink.create()) .to

Streaming pipelines with BigQuery sinks in python

寵の児 提交于 2019-12-13 12:21:04
问题 I'm building an apache beam streaming pipeline whose source is Pubsub and sink is BigQuery. I've gotten the error messsage: "Workflow failed. Causes: Unknown message code." As cryptic as this message is I now believe it to be the case that BigQuery is not supported as a sink for streaming pipelines, it says this here: Streaming from Pub/Sub to BigQuery Am I certainly correct that this is what's causing the problem? Or if not is it still not supported in any case? Can anyone hint at when this

HTTP Client in DoFn

一个人想着一个人 提交于 2019-12-12 13:25:19
问题 I would like to make POST request through a DoFn for a Apache Beam Pipeline running on Dataflow. For that, I have created a client which instanciate an HttpClosableClient configured on a PoolingHttpClientConnectionManager. However, I instanciate a client for each element that I process. How could I setup a persistent client used by all my elements? And is there other class for parallel and high-speed HTTP requests that I should use? 回答1: You can put the client into a member variable, use the

Apache Beam Python SDK with Pub/Sub source stuck at runtime

丶灬走出姿态 提交于 2019-12-11 17:41:44
问题 I am writing a program in Apache Beam using Python SDK to read from Pub/Sub the contents of a JSON file, and do some processing on the received string. This is the part in the program where I pull contents from Pub/Sub and do the processing: with beam.Pipeline(options=PipelineOptions()) as pipeline: lines = pipeline | beam.io.gcp.pubsub.ReadStringsFromPubSub(subscription=known_args.subscription) lines_decoded = lines | beam.Map(lambda x: x.decode("base64")) lines_split = lines_decoded | (beam

Apache Beam Java SDK SparkRunner write to parquet error

你说的曾经没有我的故事 提交于 2019-12-11 15:45:57
问题 I'm using Apache Beam with Java. I'm trying to read a csv file and write it to parquet format using the SparkRunner on a predeployed Spark env, using local mode. Everything worked fine with the DirectRunner, but the SparkRunner simply wont work. I'm using maven shade plugin to build a fat jat. Code is as below: Java: public class ImportCSVToParquet{ -- ommitted File csv = new File(filePath); PCollection<String> vals = pipeline.apply(TextIO.read().from(filePath)); String parquetFilename = csv

Assigning to GenericRecord the timestamp from inner object

删除回忆录丶 提交于 2019-12-11 13:36:32
问题 Processing streaming events and writing files in hourly buckets is a challenge due to windows, as some events from incoming hour can go into previous ones and such. I've been digging around Apache Beam and its triggers but I'm struggling to manage triggering with timestamp as follows... Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1))) .triggering(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(1))) .withAllowedLateness(Duration.ZERO)

Apache Beam with Dataflow - Nullpointer when reading from BigQuery

对着背影说爱祢 提交于 2019-12-11 02:29:51
问题 I am running a job on google dataflow written with apache beam that reads from BigQuery table and from files. Transforms the data and writes it into other BigQuery tables. The job "usually" succeeds, but sometimes I am randomly getting nullpointer exception when reading from big query table and my job fails: (288abb7678892196): java.lang.NullPointerException at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:98) at com.google.cloud.dataflow.worker.runners

Apache Beam - org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported)

时光毁灭记忆、已成空白 提交于 2019-12-11 02:15:06
问题 I am trying to connect to a hive instance installed in cloud instance using Apache beam-dataflow. When I run this, I am getting the below exception. This is happening when I access this db using Apache beam. I have seen many related questions which is not about apache beam or google dataflow. (c9ec8fdbe9d1719a): java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported) at com.google.cloud

Streaming MutationGroups into Spanner

一笑奈何 提交于 2019-12-07 10:05:49
问题 I'm trying to stream MutationGroups into spanner with SpannerIO. The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's. When I don't use any windows, I get the following error: Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey. at org.apache.beam.sdk.transforms.GroupByKey