google-cloud-dataflow

Deploy a Dataflow with Terraform

孤人 提交于 2021-02-11 14:27:03
问题 I'm trying to deploy a Dataflow template with Terraform in GCloud. There are several tutorial which include some terraform code. There are 2 options:Use module like the following link or use resource like the following link With both options I have the following error: Error: googleapi: got HTTP response code 502 with body: <!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 502 (Server Error)!!1<

Dataflow SQL - Unsupported type Geography

南笙酒味 提交于 2021-02-11 13:15:58
问题 I'm trying to create a Dataflow SQL on Google Big Query and I got this error Unsupported type for column centroid.centroid: GEOGRAPHY I couldnt find any evidence that Dataflow SQL actually does not support Geography data and in the documentation geography data is not mentioned at all. Is this the case, why is that and is there any workaround? 回答1: No unfortunately Dataflow SQL does not support Geography types. It supports a subset of BigQuery Standard SQL. Only the data types listed

Dataflow SQL - Unsupported type Geography

三世轮回 提交于 2021-02-11 13:11:44
问题 I'm trying to create a Dataflow SQL on Google Big Query and I got this error Unsupported type for column centroid.centroid: GEOGRAPHY I couldnt find any evidence that Dataflow SQL actually does not support Geography data and in the documentation geography data is not mentioned at all. Is this the case, why is that and is there any workaround? 回答1: No unfortunately Dataflow SQL does not support Geography types. It supports a subset of BigQuery Standard SQL. Only the data types listed

Dataflow: streaming Windmill RPC errors for a stream

两盒软妹~` 提交于 2021-02-11 12:35:38
问题 My beam dataflow try to read data from GCS and write data to Pub/Sub. However, the pipeline is hang with following error { job: "2019-11-04_03_53_38-5223486841492484115" logger: "org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer" message: "20 streaming Windmill RPC errors for a stream, last was: org.apache.beam.vendor.grpc.v1p21p0.io.grpc.StatusRuntimeException: ABORTED: The operation was aborted. with status Status{code=ABORTED, description=The operation was aborted., cause

streaming write to gcs using apache beam per element

我是研究僧i 提交于 2021-02-10 16:02:21
问题 Current beam pipeline is reading files as stream using FileIO.matchAll().continuously() . This returns PCollection . I want to write these files back with the same names to another gcs bucket i.e each PCollection is one file metadata/readableFile which should be written back to another bucket after some processing. Is there any sink that i should use to achieve writing each PCollection item back to GCS or are there any ways to do it ? Is it possible to create a window per element and then use

Dataflow with python flex template - launcher timeout

£可爱£侵袭症+ 提交于 2021-02-10 05:22:50
问题 I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with timeout. Here is some of logs I found in GCE console: INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template

Dataflow with python flex template - launcher timeout

北慕城南 提交于 2021-02-10 05:22:09
问题 I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with timeout. Here is some of logs I found in GCE console: INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template

Example to read and write parquet file using ParquetIO through Apache Beam

为君一笑 提交于 2021-02-09 17:46:08
问题 Has anybody tried reading/writing Parquet file using Apache Beam. Support is added recently in version 2.5.0, hence not much documentation. I am trying to read json input file and would like to write to parquet format. Thanks in advance. 回答1: You will need to use ParquetIO.Sink. It implements FileIO. 回答2: Add the following dependency as ParquetIO in different module. <dependency> <groupId>org.apache.beam</groupId>; <artifactId>beam-sdks-java-io-parquet</artifactId>; <version>2.6.0</version>;

Example to read and write parquet file using ParquetIO through Apache Beam

别说谁变了你拦得住时间么 提交于 2021-02-09 17:43:30
问题 Has anybody tried reading/writing Parquet file using Apache Beam. Support is added recently in version 2.5.0, hence not much documentation. I am trying to read json input file and would like to write to parquet format. Thanks in advance. 回答1: You will need to use ParquetIO.Sink. It implements FileIO. 回答2: Add the following dependency as ParquetIO in different module. <dependency> <groupId>org.apache.beam</groupId>; <artifactId>beam-sdks-java-io-parquet</artifactId>; <version>2.6.0</version>;

Exactly-once semantics in Dataflow stateful processing

房东的猫 提交于 2021-02-08 11:43:22
问题 We are trying to cover the following scenario in a streaming setting: calculate an aggregate (let’s say a count) of user events since the start of the job The number of user events is unbounded (hence only using local state is not an option) I'll discuss three options we are considering, where the two first options are prone to dataloss and the final one is unclear. We'd like to get more insight into this final one. Alternative approaches are of course welcome too. Thanks! Approach 1: Session