google-cloud-dataflow | 易学教程

Best way to prevent fusion in Google Dataflow?

阅读更多关于 Best way to prevent fusion in Google Dataflow?

问题 From: https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation. This is what I came up with in python - is this reasonable / is there a simpler way? def prevent_fuse(collection): return ( collection | beam.Map(lambda x: (x, 1)) | beam.GroupByKey() | beam.FlatMap(lambda x: (x[0] for v in x[1])) ) EDIT, in response to Ben Chambers'

Migration from DynamoDB to Spanner/BigTable

阅读更多关于 Migration from DynamoDB to Spanner/BigTable

问题 I have a use case where I need to migrate 70 TB of data from DynamoDB to BigTable and Spanner. Tables with a single index will go to BigTable else they will go to Spanner. I can easily handle the historical loads by exporting the data to S3 --> GCS --> Spanner/BigTable. But the challenging part is to handle the incremental streaming loads simultaneously happening on DynamoDB. There are 300 tables in DynamoDB. How to handle this thing in the best possible manner? Has anyone done this before?

Triggering a Dataflow job when new files are added to Cloud Storage

阅读更多关于 Triggering a Dataflow job when new files are added to Cloud Storage

问题 I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library. Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)? 回答1: This is

Using TextIO.Write with a complicated PCollection type in Google Cloud Dataflow

阅读更多关于 Using TextIO.Write with a complicated PCollection type in Google Cloud Dataflow

问题 I have a PCollection that looks like this: PCollection<KV<KV<String, EventSession>, Long>> windowed_counts My goal is to write this out as a text file. I thought to use something like: windowed_counts.apply( TextIO.Write.to( "output" )); but am having a hard time getting the Coders setup correctly. This is what I thought would work: KvCoder kvcoder = KvCoder.of(KvCoder.of(StringUtf8Coder.of(), AvroDeterministicCoder.of(EventSession.class) ), TextualLongCoder.of()); TextIO.Write.Bound io =

Tag huge list of elements with lat/long with large list of geolocation data

阅读更多关于 Tag huge list of elements with lat/long with large list of geolocation data

问题 I have a huge list of geolocation events: Event (1 billion) ------ id datetime lat long And a list of point of interest loaded from open street map: POI (1 million) ------ id tag (shop, restaurant, etc.) lat long I would like to assign to each to each event the tag of the point of interest. What is the best architecture to achieve this problem? We tried using Google BigQuery but we have to do a cross join and it does not work. We are open to use any other big data system. 回答1: Using Dataflow

Read files from a PCollection of GCS filenames in Pipeline?

阅读更多关于 Read files from a PCollection of GCS filenames in Pipeline?

问题 I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process). Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following: Get the topic from pub/sub ParDo

Cloud Dataflow: reading entire text files rather than lines by line

阅读更多关于 Cloud Dataflow: reading entire text files rather than lines by line

问题 I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely. What's the best approach to it? 回答1: I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different. I think what you want to do is to define a new subclass of FileBasedSource and use Read.from(

Join 2 JSON inputs linked by Primary Key

阅读更多关于 Join 2 JSON inputs linked by Primary Key

问题 I am trying to merge 2 JSON inputs (this example is from a file, but it will be from a Google Pub Sub input later) from these: orderID.json: {"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1} combined.json: {"barcode":"95590","name":"Ash","quantity":6,"orderID":"test1"} {"barcode":"95591","name":"Beat","quantity":6,"orderID":"test1"} {"barcode":"95592","name":"Cat","quantity":6,"orderID":"test1"} {"barcode":"95593","name":"Dog","quantity":6,"orderID

How to reshuffle a PCollection<T>?

阅读更多关于 How to reshuffle a PCollection?

问题 I am trying to implement a Reshuffle transform to prevent excessive fusion, but I don't know how to alter the version for <KV<String,String>> to deal with simple PCollections. (How to reshuffle PCollection <KV<String,String>> is described here.) How would I expand the official Avro I/O example code to reshuffle before adding more steps in my pipeline? PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); Schema schema = new Schema.Parser().parse(new

Dataflow GroupBy -> multiple outputs based on keys

阅读更多关于 Dataflow GroupBy -> multiple outputs based on keys

问题 Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.<String, KV<Long,Iterable<TableRow>>>create()) .apply(ParDo.named("Print Bins").of( ... ) .apply(TextIO.Write.to(*Output file based on key*)) If Sink is the solution, would you please share a sample code w/ me? Thanks! 回答1: Beam 2.2 will include an API to do just that - TextIO.write().to(DynamicDestinations) , see source. For now, if you'd like to use this API