apache-beam

Why is my fusion breaker losing or holding back data?

Deadly 提交于 2019-12-12 01:29:25
问题 I am working on a streaming Dataflow pipeline that consumes messages of batched items from PubSub and eventually writes them to Datastore. For better parallelism, and also for timely acknowledgement of the messages pulled from the PubSub, I unpack the batches into individual items and add a fusion breaker right after it. So the pipeline looks like this ... PubSubIO -> deserialize -> unpack -> fusion break -> validation/transform -> DatastoreIO. Here is my fusion breaker, largely copied from

Joining rows in Apache Beam

六眼飞鱼酱① 提交于 2019-12-11 23:37:24
问题 I'm having trouble understanding if the joins in Apache Beam (e.g. http://www.waitingforcode.com/apache-beam/joins-apache-beam/read) can join entire rows. For example: I have 2 datasets, in CSV format, where the first rows are column headers. The first: a,b,c,d 1,2,3,4 5,6,7,8 1,2,5,4 The second: c,d,e,f 3,4,9,10 I want to left join on columns c and d so that I end up with: a,b,c,d,e,f 1,2,3,4,9,10 5,6,7,8,, 1,2,5,4,, However all the documentation on Apache Beam seems to say the PCollection

No repackaged dependencies when building Apache Beam Cassandra JAR

此生再无相见时 提交于 2019-12-11 18:48:06
问题 Trying to compile and use the snapshot for Apache Beam Cassandra JAR. Seems like the build does not pack the Guava dependencies within the JAR. This causes compilation to fail when the JAR is used by other code - see following Exception: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/beam/vendor/guava/v20_0/com/google/common/base/Preconditions at org.apache.beam.sdk.io.cassandra.CassandraIO$Read.withHosts(CassandraIO.java:180) at org.apache.beam.examples

Dataflow template job is not taking input parameters

久未见 提交于 2019-12-11 17:55:14
问题 I have a dataflow template created with below command python scrap.py --setup_file /home/deepak_verma/setup.py --temp_location gs://visualization-dev/temp --staging_location gs://visualization-dev/stage --project visualization-dev --job_name scrap-job --subnetwork regions/us-east1/subnetworks/dataflow-internal --region us-east1 --input sentiment_analysis.table_view --output gs://visualization-dev/incoming --runner DataflowRunner --template_location gs://visualization-dev/template/scrap My

How to deduplicate messages from GCP PubSub in DataFlow using Apache Beam's PubSubIO withIdAttribute

两盒软妹~` 提交于 2019-12-11 17:44:10
问题 I'm currently attempting to use withIdAttribute with PubSubIO to deduplicate messages that come from PubSub (since PubSub only guarantees at least once delivery). My messages have four fields, label1 , label2 , timestamp , and value . A value is unique to the two labels at some timestamp. Therefore, I additionally set a uniqueID attribute before writing to PubSub equal to these three values joined as a string. For example, this is what I get from reading from a subscription using the gcp

Apache Beam Python SDK with Pub/Sub source stuck at runtime

丶灬走出姿态 提交于 2019-12-11 17:41:44
问题 I am writing a program in Apache Beam using Python SDK to read from Pub/Sub the contents of a JSON file, and do some processing on the received string. This is the part in the program where I pull contents from Pub/Sub and do the processing: with beam.Pipeline(options=PipelineOptions()) as pipeline: lines = pipeline | beam.io.gcp.pubsub.ReadStringsFromPubSub(subscription=known_args.subscription) lines_decoded = lines | beam.Map(lambda x: x.decode("base64")) lines_split = lines_decoded | (beam

How can I stop the extra repetition in the return/yield, while still keeping the running totals for a given key: value pair?

安稳与你 提交于 2019-12-11 17:38:32
问题 After passing the Pcollection to the next transform, the return/yield of the Transform is being multiplied, when I only need a single KV pair for a given street and accident count. My understanding is that generators can assist in this, by holding values, but that only solves part of my problem. I've attempted to determine size prior to sending to next transform, but I haven't found any methods that give me true size of the Pcollection elements being passed. class CountAccidents(beam.DoFn):

Beam / DataFlow ::ReadFromPubSub(id_label) :: Unexpected behavior

。_饼干妹妹 提交于 2019-12-11 17:18:41
问题 Can someone clarify what's the purpose for id_label argument in ReafFromPubSub transform? I'm using BigQuery sink, my understanding it acts like an insertId for BQ Streaming API, Tabledata: insertAll A unique ID for each row. BigQuery uses this property to detect duplicate insertion requests on a best-effort basis. For more information, see data consistency. However I don't see this expected behaviour. I'm publishing messages to Pub/Sub, each message with same attribute message_id value (this

Dataflow: No Worker Activity

大憨熊 提交于 2019-12-11 17:00:03
问题 I'm having a few problems running a relatively vanilla Dataflow job from an AI Platform Notebook (the job is meant to take data from BigQuery > cleanse and prep > write to a CSV in GCS): options = {'staging_location': '/staging/location/', 'temp_location': '/temp/location/', 'job_name': 'dataflow_pipeline_job', 'project': PROJECT, 'teardown_policy': 'TEARDOWN_ALWAYS', 'max_num_workers': 3, 'region': REGION, 'subnetwork': 'regions/<REGION>/subnetworks/<SUBNETWORK>', 'no_save_main_session':

Ranking pcollection elements

社会主义新天地 提交于 2019-12-11 16:47:30
问题 I am using Google DataFlow Java SDK 2.2.0. Use case as follows: PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements. PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements. task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).