google-cloud-dataflow

Running periodic Dataflow job

你离开我真会死。 提交于 2020-02-08 02:58:04
问题 I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)? Should I have endless loop inside the "main" creating and executing the same pipeline again and again? If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the

Schema update while writing to Avro files

空扰寡人 提交于 2020-02-06 08:47:09
问题 Context: We have a Dataflow job that transforms PubSub messages into Avro GenericRecords and writes them into GCS as ".avro". The transformation between PubSub messages and GenericRecords requires a schema. This schema changes weekly with field additions only. We want to be able to update the fields without updating the Dataflow job. What we did: We took the advice from this post and created a Guava Cache that refreshes the content every minute. The refresh function will pull schema from GCS.

Max and Min for several fields inside PCollection in apache beam with python

China☆狼群 提交于 2020-02-06 08:17:30
问题 I am using apache beam via python SDK and have the following problem: I have a PCollection with approximately 1 mln entries, each entry in a PCollection looks like a list of 2-tuples [(key1,value1),(key2,value2),...] with length ~150. I need to find max and min values across all entries of the PCollection for each key in order normalize the values. Ideally, it will be good to obtain PCollection with a list of tuples [(key,max_value,min_value),...] and then it will be easy to proceed with

How to notify when DataFlow Job is complete

喜夏-厌秋 提交于 2020-02-05 14:07:37
问题 I want to know on GAE when dataflow job is completed. I tries to make the following both pipeline 1. | 'write to bigquery' >> beam.io.WriteToBigQuery(...) | WriteStringsToPubSub('projects/fakeprj/topics/a_topic') 2. | 'write to bigquery' >> beam.io.WriteToBigQuery(...) | 'DoPubSub' >> beam.ParDo(DoPubSub()) # do Publish using google.cloud.pubsub But the both above code produces the following error: AttributeError: 'PDone' object has no attribute 'windowing' How to do procedure after

How to notify when DataFlow Job is complete

对着背影说爱祢 提交于 2020-02-05 14:07:13
问题 I want to know on GAE when dataflow job is completed. I tries to make the following both pipeline 1. | 'write to bigquery' >> beam.io.WriteToBigQuery(...) | WriteStringsToPubSub('projects/fakeprj/topics/a_topic') 2. | 'write to bigquery' >> beam.io.WriteToBigQuery(...) | 'DoPubSub' >> beam.ParDo(DoPubSub()) # do Publish using google.cloud.pubsub But the both above code produces the following error: AttributeError: 'PDone' object has no attribute 'windowing' How to do procedure after

How to notify when DataFlow Job is complete

强颜欢笑 提交于 2020-02-05 14:06:32
问题 I want to know on GAE when dataflow job is completed. I tries to make the following both pipeline 1. | 'write to bigquery' >> beam.io.WriteToBigQuery(...) | WriteStringsToPubSub('projects/fakeprj/topics/a_topic') 2. | 'write to bigquery' >> beam.io.WriteToBigQuery(...) | 'DoPubSub' >> beam.ParDo(DoPubSub()) # do Publish using google.cloud.pubsub But the both above code produces the following error: AttributeError: 'PDone' object has no attribute 'windowing' How to do procedure after

Sideload static data

北城余情 提交于 2020-02-05 02:04:53
问题 When processing my data in a ParDo I need to use a JSON schema stored on Google Cloud Storage. I think this maybe is sideloading? I read the pages they call documentation (https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.pvalue.html) and it contains something about apache_beam.pvalue.AsSingleton and apache_beam.pvalue.AsSideInput but there are zero results if I Google on the usage of those and I can't find any example for Python. How can I read a file from storage from within a ParDo

Google Dataflow “No filesystem found for scheme gs”

走远了吗. 提交于 2020-02-04 05:54:05
问题 I'm trying to execute a Google Dataflow Application, but it is throw this Exception java.lang.IllegalArgumentException: No filesystem found for scheme gs at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:459) at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:529) at org.apache.beam.sdk.io.FileBasedSink.convertToFileResourceIfPossible(FileBasedSink.java:213) at org.apache.beam.sdk.io.TextIO$TypedWrite.to(TextIO.java:700) at org.apache.beam.sdk

Stream BigQuery table into Google Pub/Sub

独自空忆成欢 提交于 2020-02-03 07:23:43
问题 I have a Google bigQuery Table and I want to stream the entire table into pub-sub Topic what should be the easy/fast way to do it? Thank you in advance, 回答1: That really depends on the size of the table. If it's a small table (a few thousand records, a couple doze columns) then you could setup a process to query the entire table, convert the response into a JSON array, and push to pub-sub. If it's a big table (millions/billions of records, hundreds of columns) you'd have to export to file,

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

北战南征 提交于 2020-01-31 19:43:31
问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not