google-cloud-dataflow

Best way to prevent fusion in Google Dataflow?

别来无恙 提交于 2020-01-14 12:37:50
问题 From: https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation. This is what I came up with in python - is this reasonable / is there a simpler way? def prevent_fuse(collection): return ( collection | beam.Map(lambda x: (x, 1)) | beam.GroupByKey() | beam.FlatMap(lambda x: (x[0] for v in x[1])) ) EDIT, in response to Ben Chambers'

Migration from DynamoDB to Spanner/BigTable

左心房为你撑大大i 提交于 2020-01-14 10:48:26
问题 I have a use case where I need to migrate 70 TB of data from DynamoDB to BigTable and Spanner. Tables with a single index will go to BigTable else they will go to Spanner. I can easily handle the historical loads by exporting the data to S3 --> GCS --> Spanner/BigTable. But the challenging part is to handle the incremental streaming loads simultaneously happening on DynamoDB. There are 300 tables in DynamoDB. How to handle this thing in the best possible manner? Has anyone done this before?

Triggering a Dataflow job when new files are added to Cloud Storage

女生的网名这么多〃 提交于 2020-01-14 04:53:31
问题 I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library. Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)? 回答1: This is

Using TextIO.Write with a complicated PCollection type in Google Cloud Dataflow

南楼画角 提交于 2020-01-14 04:04:51
问题 I have a PCollection that looks like this: PCollection<KV<KV<String, EventSession>, Long>> windowed_counts My goal is to write this out as a text file. I thought to use something like: windowed_counts.apply( TextIO.Write.to( "output" )); but am having a hard time getting the Coders setup correctly. This is what I thought would work: KvCoder kvcoder = KvCoder.of(KvCoder.of(StringUtf8Coder.of(), AvroDeterministicCoder.of(EventSession.class) ), TextualLongCoder.of()); TextIO.Write.Bound io =

Tag huge list of elements with lat/long with large list of geolocation data

夙愿已清 提交于 2020-01-13 20:46:34
问题 I have a huge list of geolocation events: Event (1 billion) ------ id datetime lat long And a list of point of interest loaded from open street map: POI (1 million) ------ id tag (shop, restaurant, etc.) lat long I would like to assign to each to each event the tag of the point of interest. What is the best architecture to achieve this problem? We tried using Google BigQuery but we have to do a cross join and it does not work. We are open to use any other big data system. 回答1: Using Dataflow

Read files from a PCollection of GCS filenames in Pipeline?

孤街醉人 提交于 2020-01-13 09:28:25
问题 I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process). Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following: Get the topic from pub/sub ParDo

Cloud Dataflow: reading entire text files rather than lines by line

岁酱吖の 提交于 2020-01-12 05:36:29
问题 I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely. What's the best approach to it? 回答1: I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different. I think what you want to do is to define a new subclass of FileBasedSource and use Read.from(

Join 2 JSON inputs linked by Primary Key

三世轮回 提交于 2020-01-11 11:28:11
问题 I am trying to merge 2 JSON inputs (this example is from a file, but it will be from a Google Pub Sub input later) from these: orderID.json: {"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1} combined.json: {"barcode":"95590","name":"Ash","quantity":6,"orderID":"test1"} {"barcode":"95591","name":"Beat","quantity":6,"orderID":"test1"} {"barcode":"95592","name":"Cat","quantity":6,"orderID":"test1"} {"barcode":"95593","name":"Dog","quantity":6,"orderID

How to reshuffle a PCollection<T>?

南楼画角 提交于 2020-01-11 03:42:26
问题 I am trying to implement a Reshuffle transform to prevent excessive fusion, but I don't know how to alter the version for <KV<String,String>> to deal with simple PCollections. (How to reshuffle PCollection <KV<String,String>> is described here.) How would I expand the official Avro I/O example code to reshuffle before adding more steps in my pipeline? PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); Schema schema = new Schema.Parser().parse(new

Dataflow GroupBy -> multiple outputs based on keys

本小妞迷上赌 提交于 2020-01-07 05:02:13
问题 Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.<String, KV<Long,Iterable<TableRow>>>create()) .apply(ParDo.named("Print Bins").of( ... ) .apply(TextIO.Write.to(*Output file based on key*)) If Sink is the solution, would you please share a sample code w/ me? Thanks! 回答1: Beam 2.2 will include an API to do just that - TextIO.write().to(DynamicDestinations) , see source. For now, if you'd like to use this API