apache-beam

Running Apache Beam pipeline in Spring Boot project on Google Data Flow

依然范特西╮ 提交于 2021-02-19 08:27:34
问题 I'm trying the run an Apache Beam pipeline in a Spring Boot project on Google Data Flow, but I keep having this error Failed to construct instance from factory method DataflowRunner#fromOptions(interfaceorg.apache.beam.sdk.options.PipelineOptions The example I'm trying to run is a basic word count provided by the official documentation, https://beam.apache.org/get-started/wordcount-example/ . The problem is that this example is using different classes for each example, and each example has

“java.lang.IllegalArgumentException: No filesystem found for scheme gs” when running dataflow in google cloud platform

淺唱寂寞╮ 提交于 2021-02-19 05:01:47
问题 I am running my google dataflow job in Google Cloud Platform(GCP). When I run this job locally it worked well, but when running it on GCP, I got this error "java.lang.IllegalArgumentException: No filesystem found for scheme gs". I have access to that google cloud URI, I can upload my jar file to that URI and I can see some temporary file for my local job. My Job id in GCP: 2019-08-08_21_47_27-162804342585245230 (beam version:2.12.0) 2019-08-09_16_41_15-11728697820819900062 (beam version:2.14

Google Cloud dataflow : How to initialize Hikari connection pool only once per worker (singleton)?

有些话、适合烂在心里 提交于 2021-02-11 17:40:14
问题 Hibernate Utils is creating the session factory along with Hikari configuration. Currently we are doing inside @Setup method of ParDo, but it opens way too many connections. So is there any good example to initialize connection pool per worker ? 回答1: If you are using @Setup method inside DoFn to create a database connection keep in mind that Apache Beam would create connection pool per worker instance thread. This might result in a lot of database connections depending on the number of

Avro Schema for GenericRecord: Be able to leave blank fields

回眸只為那壹抹淺笑 提交于 2021-02-11 17:13:34
问题 I'm using Java to convert JSON to Avro and store these to GCS using Google DataFlow. The Avro schema is created on runtime using SchemaBuilder. One of the fields I define in the schema is an optional LONG field, it is defined like this: SchemaBuilder.FieldAssembler<Schema> fields = SchemaBuilder.record(mainName).fields(); Schema concreteType = SchemaBuilder.nullable().longType(); fields.name("key1").type(concreteType).noDefault(); Now when I create a GenericRecord using the schema above, and

Avro Schema for GenericRecord: Be able to leave blank fields

夙愿已清 提交于 2021-02-11 17:12:09
问题 I'm using Java to convert JSON to Avro and store these to GCS using Google DataFlow. The Avro schema is created on runtime using SchemaBuilder. One of the fields I define in the schema is an optional LONG field, it is defined like this: SchemaBuilder.FieldAssembler<Schema> fields = SchemaBuilder.record(mainName).fields(); Schema concreteType = SchemaBuilder.nullable().longType(); fields.name("key1").type(concreteType).noDefault(); Now when I create a GenericRecord using the schema above, and

Avro Schema for GenericRecord: Be able to leave blank fields

北战南征 提交于 2021-02-11 17:11:47
问题 I'm using Java to convert JSON to Avro and store these to GCS using Google DataFlow. The Avro schema is created on runtime using SchemaBuilder. One of the fields I define in the schema is an optional LONG field, it is defined like this: SchemaBuilder.FieldAssembler<Schema> fields = SchemaBuilder.record(mainName).fields(); Schema concreteType = SchemaBuilder.nullable().longType(); fields.name("key1").type(concreteType).noDefault(); Now when I create a GenericRecord using the schema above, and

Better approach to call external API in apache beam

这一生的挚爱 提交于 2021-02-11 14:39:47
问题 I have 2 approaches to initialize the HttpClient in order to make an API call from a ParDo in Apache Beam. Approach 1: Initialise the HttpClient object in the StartBundle and close the HttpClient in FinishBundle . The code is as follows: public class ProcessNewIncomingRequest extends DoFn<String, KV<String, String>> { @StartBundle public void startBundle() { HttpClient client = HttpClient.newHttpClient(); HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(<Custom_URL>)) .build();

Dataflow: streaming Windmill RPC errors for a stream

两盒软妹~` 提交于 2021-02-11 12:35:38
问题 My beam dataflow try to read data from GCS and write data to Pub/Sub. However, the pipeline is hang with following error { job: "2019-11-04_03_53_38-5223486841492484115" logger: "org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer" message: "20 streaming Windmill RPC errors for a stream, last was: org.apache.beam.vendor.grpc.v1p21p0.io.grpc.StatusRuntimeException: ABORTED: The operation was aborted. with status Status{code=ABORTED, description=The operation was aborted., cause

streaming write to gcs using apache beam per element

我是研究僧i 提交于 2021-02-10 16:02:21
问题 Current beam pipeline is reading files as stream using FileIO.matchAll().continuously() . This returns PCollection . I want to write these files back with the same names to another gcs bucket i.e each PCollection is one file metadata/readableFile which should be written back to another bucket after some processing. Is there any sink that i should use to achieve writing each PCollection item back to GCS or are there any ways to do it ? Is it possible to create a window per element and then use

Pipeline fails when addng ReadAllFromText transform

允我心安 提交于 2021-02-10 08:45:57
问题 I am trying to run a very simple program in Apache Beam to try out how it works. import apache_beam as beam class Split(beam.DoFn): def process(self, element): return element with beam.Pipeline() as p: rows = (p | beam.io.ReadAllFromText( "input.csv") | beam.ParDo(Split())) While running this, I get the following errors .... some more stack.... File "/home/raheel/code/beam-practice/lib/python2.7/site-packages/apache_beam/transforms/util.py", line 565, in expand windowing_saved = pcoll