google-cloud-dataflow

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

不羁岁月 提交于 2020-01-31 19:38:10
问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

天大地大妈咪最大 提交于 2020-01-31 19:37:43
问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not

Apache Beam 2.12.0 with Java 11 support?

人走茶凉 提交于 2020-01-25 07:37:09
问题 Does Apache Beam 2.12.0 support Java 11, or should i go still stick with a stable Java 8 SDK as for now? I see the site recommends Python 3.5 with Beam 2.12.0 as per the documentation, compared to other higher Python versions. How much compartible it is with Java 11 at this time. So, would a stable version would be still Java 8 to go with Apache Beam 2.12.0. I faced few build issues when using Beam 2.12.0 with Java 11. 回答1: Beam officially doesn't support Java 11, it has only experimental

SSLHandshakeException when running Apache Beam Pipeline in Dataflow

守給你的承諾、 提交于 2020-01-25 07:26:07
问题 I have an Apache Beam Pipeline. In one of the DoFn steps it does an https call (think REST API). All this works fine with DirectRun in my local environment. This is my local environment, apache beam 2.16.0: $ mvn -version Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-04T12:00:29-07:00) Maven home: /opt/apache-maven-3.6.1 Java version: 1.8.0_222, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre Default locale: en, platform encoding: UTF-8 OS name:

Attribute error while creating custom template using python in Google Cloud DataFlow

不想你离开。 提交于 2020-01-25 07:04:11
问题 I am facing issue while creating custom template for Cloud Dataflow. its simple code that takes data from input bucket and loads in BigQuery. We want to load many tables so trying to create custom template. once this works, next step would be passing dataset also as parameter. Error message : AttributeError: 'StaticValueProvider' object has no attribute 'datasetId' Code class ContactUploadOptions(PipelineOptions): """ Runtime Parameters given during template execution path and organization

How to trigger a dataflow with a cloud function? (Python SDK)

北城余情 提交于 2020-01-25 06:49:27
问题 I have a cloud function that is triggered by cloud Pub/Sub. I want the same function trigger dataflow using Python SDK. Here is my code: import base64 def hello_pubsub(event, context): if 'data' in event: message = base64.b64decode(event['data']).decode('utf-8') else: message = 'hello world!' print('Message of pubsub : {}'.format(message)) I deploy the function this way: gcloud beta functions deploy hello_pubsub --runtime python37 --trigger-topic topic1 回答1: You have to embed your pipeline

Iterative processing in Dataflow

故事扮演 提交于 2020-01-25 04:16:21
问题 As shown here Dataflow pipelines are represented by a fixed DAG. I'm wondering if it's possible to implement a pipeline where the processing proceeds until a dynamically evaluated condition is satisfied based on the data computed so far. Here's some pseudo code to illustrate what I'd like to implement: PCollection pco = null while(true): pco = pco.apply(someTransform()) if (conditionSatisfied(pco)): break pco.Write() 回答1: It seems like you really want iterative computations. Right now

Failed to update work status Exception in Python Cloud Dataflow

非 Y 不嫁゛ 提交于 2020-01-24 19:18:22
问题 I have a Python Cloud Dataflow job that works fine on smaller subsets, but seems to be failing for no obvious reasons on the complete dataset. The only error I get in the Dataflow interface is the standard error message: A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. Analysing the Stackdriver logs only shows this error: Exception in worker loop: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages

Google-cloud-dataflow: Failed to insert json data to bigquery through `WriteToBigQuery/BigQuerySink` with `BigQueryDisposition.WRITE_TRUNCATE`

瘦欲@ 提交于 2020-01-24 13:04:07
问题 Given the data set as below {"slot":"reward","result":1,"rank":1,"isLandscape":false,"p_type":"main","level":1276,"type":"ba","seqNum":42544} {"slot":"reward_dlg","result":1,"rank":1,"isLandscape":false,"p_type":"main","level":1276,"type":"ba","seqNum":42545} ...more type json data here I try to filter those json data and insert them into bigquery with python sdk as following ba_schema = 'slot:STRING,result:INTEGER,play_type:STRING,level:INTEGER' class ParseJsonDoFn(beam.DoFn): B_TYPE = 'tag

Apache Beam - What are the key concepts for writing efficient data processing pipelines I should be aware of?

元气小坏坏 提交于 2020-01-24 01:12:28
问题 I've been using Beam for some time now and I'd like to know what are the key concepts for writing efficient and optimized Beam pipelines. I have a little Spark background and I know that we may prefer to use a reduceByKey instead of a groupByKey to avoid shuffling and optimise network traffic. Is it the same case for Beam? I'd appreciate some tips or materials/best pratices. 回答1: Some items to consider: Graph Design Considerations: Filer first; place filter operations as high in the DAG as