google-cloud-dataflow | 易学教程

Running periodic Dataflow job

阅读更多关于 Running periodic Dataflow job

问题 I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)? Should I have endless loop inside the "main" creating and executing the same pipeline again and again? If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the

Schema update while writing to Avro files

阅读更多关于 Schema update while writing to Avro files

问题 Context: We have a Dataflow job that transforms PubSub messages into Avro GenericRecords and writes them into GCS as ".avro". The transformation between PubSub messages and GenericRecords requires a schema. This schema changes weekly with field additions only. We want to be able to update the fields without updating the Dataflow job. What we did: We took the advice from this post and created a Guava Cache that refreshes the content every minute. The refresh function will pull schema from GCS.

Max and Min for several fields inside PCollection in apache beam with python

阅读更多关于 Max and Min for several fields inside PCollection in apache beam with python

问题 I am using apache beam via python SDK and have the following problem: I have a PCollection with approximately 1 mln entries, each entry in a PCollection looks like a list of 2-tuples [(key1,value1),(key2,value2),...] with length ~150. I need to find max and min values across all entries of the PCollection for each key in order normalize the values. Ideally, it will be good to obtain PCollection with a list of tuples [(key,max_value,min_value),...] and then it will be easy to proceed with

How to notify when DataFlow Job is complete

阅读更多关于 How to notify when DataFlow Job is complete

问题 I want to know on GAE when dataflow job is completed. I tries to make the following both pipeline 1. | 'write to bigquery' >> beam.io.WriteToBigQuery(...) | WriteStringsToPubSub('projects/fakeprj/topics/a_topic') 2. | 'write to bigquery' >> beam.io.WriteToBigQuery(...) | 'DoPubSub' >> beam.ParDo(DoPubSub()) # do Publish using google.cloud.pubsub But the both above code produces the following error: AttributeError: 'PDone' object has no attribute 'windowing' How to do procedure after

How to notify when DataFlow Job is complete

阅读更多关于 How to notify when DataFlow Job is complete

How to notify when DataFlow Job is complete

阅读更多关于 How to notify when DataFlow Job is complete

Sideload static data

阅读更多关于 Sideload static data

问题 When processing my data in a ParDo I need to use a JSON schema stored on Google Cloud Storage. I think this maybe is sideloading? I read the pages they call documentation (https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.pvalue.html) and it contains something about apache_beam.pvalue.AsSingleton and apache_beam.pvalue.AsSideInput but there are zero results if I Google on the usage of those and I can't find any example for Python. How can I read a file from storage from within a ParDo

Google Dataflow “No filesystem found for scheme gs”

阅读更多关于 Google Dataflow “No filesystem found for scheme gs”

问题 I'm trying to execute a Google Dataflow Application, but it is throw this Exception java.lang.IllegalArgumentException: No filesystem found for scheme gs at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:459) at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:529) at org.apache.beam.sdk.io.FileBasedSink.convertToFileResourceIfPossible(FileBasedSink.java:213) at org.apache.beam.sdk.io.TextIO$TypedWrite.to(TextIO.java:700) at org.apache.beam.sdk

Stream BigQuery table into Google Pub/Sub

阅读更多关于 Stream BigQuery table into Google Pub/Sub

问题 I have a Google bigQuery Table and I want to stream the entire table into pub-sub Topic what should be the easy/fast way to do it? Thank you in advance, 回答1: That really depends on the size of the table. If it's a small table (a few thousand records, a couple doze columns) then you could setup a process to query the entire table, convert the response into a JSON array, and push to pub-sub. If it's a big table (millions/billions of records, hundreds of columns) you'd have to export to file,

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

阅读更多关于 Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not