apache-beam

Apache beam parsing data flow pub/sub into a dictionary

别来无恙 提交于 2019-12-13 04:26:08
问题 I am running a streaming pipeline using beam / dataflow. I am reading my input from pub/sub as converting into a dict as below: raw_loads_dict = (p | 'ReadPubsubLoads' >> ReadFromPubSub(topic=PUBSUB_TOPIC_NAME).with_output_types(bytes) | 'JSONParse' >> beam.Map(lambda x: json.loads(x)) ) Since this is done on each element of a high throughput pipeline I am worried that this is not the most efficient way to do this? What is the best practice in this case, considering I am then manipulating the

Join Nested Structure Table using Dataflow Java code

戏子无情 提交于 2019-12-13 04:16:52
问题 My objective is to join two tables, where the second table is normal and the first one is nested structure table. The join key is available inside the nested structure in first table. In this case, how to Join these two tables using dataflow java code. WithKeys (org.apache.beam.sdk.transforms.WithKeys) accepting direct column name and it does not allow like firstTable.columnname . Could some one to help to solve this case. 回答1: If both tables are equally large consider using the CoGroupByKey

Writing to GCS with Dataflow using element count

℡╲_俬逩灬. 提交于 2019-12-13 04:00:18
问题 This is in reference to Apache Beam SDK Version 2.2.0. I'm attempting to use AfterPane.elementCountAtLeast(...) but not having any success so far. What I want looks a lot like Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn, but needs to be adapted to 2.2.0. Ultimately I just need a simple OR where a file is written after X elements OR Y time has passed. I intend to set the time very high so that the write happens on the number of elements in the majority of cases,

How To Filter None Values Out Of PCollection

点点圈 提交于 2019-12-13 03:58:41
问题 My pubsub pull subscription is sending over the message and a None value for each message. I need to find a way to filter out the none values as part of my pipeline processing Of course some help preventing the none values from arriving from the pull subscription would be nice. But I feel like I'm missing something about the general workflow of defining & applying functions via ParDo. I've set up a function to filter out none values which seems to work based on a print to console check,

How to parallelize HTTP requests within an Apache Beam step?

…衆ロ難τιáo~ 提交于 2019-12-13 03:57:15
问题 I have an Apache Beam pipeline running on Google Dataflow whose job is rather simple: It reads individual JSON objects from Pub/Sub Parses them And sends them via HTTP to some API This API requires me to send the items in batches of 75. So I built a DoFn that accumulates events in a list and publish them via this API once they I get 75. This results to be too slow, so I thought instead of executing those HTTP requests in different threads using a thread pool. The implementation of what I have

Does Dataflow jobs hit any Bigquery quotas and limits?

…衆ロ難τιáo~ 提交于 2019-12-13 03:54:17
问题 I have around 1500 jobs to be implemented using Dataflow. Those jobs will be scheduled on daily basis. We may get to use huge number of DML statements using Bigquery Client library within our jobs. Listing down my concerns regarding Bigquery quotas and limits. Reference: https://cloud.google.com/bigquery/quotas Please confirm that do we need to take the daily usage limits of Bigquery into consideration in any of the below mentioned scenarios. If we implement data inserts using BigqueryIO

Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

北战南征 提交于 2019-12-13 03:32:23
问题 The above image is the table schema for a big query table which is the input into an apache beam dataflow job that runs on spotify's scio. If you aren't familiar with scio it's a Scala wrapper around the Apache Beam Java SDK. In particular, a "SCollection wraps PCollection". My input table on BigQuery disk is 136 gigs, but upon looking at the size of my SCollection in the dataflow UI it is 504.91 GB. I understand that BigQuery is likely much better at data compression and representation, but

ValueProvider type parameters not getting honored at the template execution time

半腔热情 提交于 2019-12-13 03:26:14
问题 I am trying to pass BigTable tableId, instanceId and projectId which are defined as ValueProvider in the TemplateOption class at the execution time as they are runtime values but they don't get honored with the new values . The pipleine gets executed with the old values which were defined when the pipeline was constructed. What changes should i make so that it honors values at runtime? Pipeline p = Pipeline.create(options); com.google.cloud.bigtable.config.BigtableOptions.Builder

How to transform an SQL table into a list of row sequences using BigQuery and Apache Beam?

谁说我不能喝 提交于 2019-12-13 02:55:09
问题 I have a very large table where each row represents an abstraction called a Trip. Trips consist of numeric columns such as vehicle id, trip id, start time, stop time, distance traveled, driving duration, etc. So each Trip is a 1D vector of floating point values. I want to transform this table, or list of vectors, into a list of Trip sequences where Trips are grouped into sequences by vehicle id and are in order according to start time. The sequence length needs to be limited to a specific

How to read bigQuery from PCollection in Dataflow

流过昼夜 提交于 2019-12-12 19:11:49
问题 I Have a PCollection of Object that I get from pubsub, let say : PCollection<Student> pStudent ; and in student attributes, there is an attribute let say studentID; and I want to read attributes (class_code) from BigQuery with this student id and set the class_code that I get from BQ to student Object in PCollcetion is there anyone know how to implement this? I know that in beam there is a BigQueryIO but how can I do that, if the query string criteria that I want to execute in BQ is from