apache-beam

Cloud Dataflow - how does Dataflow do parallelism?

蹲街弑〆低调 提交于 2019-12-10 11:13:42
问题 My question is, behind the scene, for element-wise Beam DoFn (ParDo), how does the Cloud Dataflow parallel workload? For example, in my ParDO, I send out one http request to an external server for one element. And I use 30 workers, each has 4vCPU. Does that mean on each worker, there will be 4 threads at maximum? Does that mean from each worker, only 4 http connections are necessary or can be established if I keep them alive to get the best performance? How can I adjust the level of

Beam / Dataflow Custom Python job - Cloud Storage to PubSub

走远了吗. 提交于 2019-12-10 10:25:43
问题 I need to perform a very simple transformation on some data (extract a string from JSON), then write it to PubSub - I'm attempting to use a custom python Dataflow job to do so. I've written a job which successfully writes back to Cloud Storage, but my attempts at even the simplest possible write to PubSub (no transformation) result in an error: JOB_MESSAGE_ERROR: Workflow failed. Causes: Expected custom source to have non-zero number of splits. Has anyone successfully written to PubSub from

BigQueryIO Read performance using withTemplateCompatibility

烈酒焚心 提交于 2019-12-10 07:04:28
问题 Apache Beam 2.1.0 had a bug with template pipelines that read from BigQuery which meant they could only be executed once. More details here https://issues.apache.org/jira/browse/BEAM-2058 This has been fixed with the release of Beam 2.2.0, you can now read from BigQuery using the withTemplateCompatibility option, your template pipeline can now be run multiple times. pipeline .apply("Read rows from table." , BigQueryIO.readTableRows() .withTemplateCompatibility() .from("<your-table>")

Maven conflict in Java app with google-cloud-core-grpc dependency

回眸只為那壹抹淺笑 提交于 2019-12-09 11:17:26
问题 (I've also raised a GitHub issue for this - https://github.com/googleapis/google-cloud-java/issues/4095) I have the latest versions of the following 2 dependencies for Apache Beam: Dependency 1 - google-cloud-dataflow-java-sdk-all (A distribution of Apache Beam designed to simplify usage of Apache Beam on Google Cloud Dataflow service - https://mvnrepository.com/artifact/com.google.cloud.dataflow/google-cloud-dataflow-java-sdk-all) <dependency> <groupId>com.google.cloud.dataflow</groupId>

Dataflow Pipeline - “Processing stuck in step <STEP_NAME> for at least <TIME> without outputting or completing in state finish…”

霸气de小男生 提交于 2019-12-09 04:46:22
问题 The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck. The peculiar thing is that the steps that are failing are different, one is a BigQuery output and another for Cloud Storage output. The following are the log messages that we are receiving: For BigQuery output: Processing stuck in step <STEP_NAME>/StreamingInserts/StreamingWriteTables

Singleton in Google Dataflow

瘦欲@ 提交于 2019-12-09 04:04:27
I have a dataflow which reads the messages from PubSub. I need to enrich this message using couple of API's. I want to have a single instance of this API to used for processing all records. This is to avoid initializing API for every request. I tried creating a static variable, but still I see the API is initialized many times. How to avoid initializing of a variable multiple times in Google Dataflow? Pablo Dataflow uses multiple machines in parallel to do data analysis, so your API will have to be initialized at least once per machine. In fact, Dataflow does not have strong guarantees on the

Beam/Dataflow 2.2.0 - extract first n elements from pcollection

我只是一个虾纸丫 提交于 2019-12-08 14:30:23
问题 Is there any way to extract first n elements in a beam pcollection? The documentation doesn't seem to indicate any such function. I think such an operation would require first a global element number assignment and then a filter - would be nice to have this functionality. I use Google DataFlow Java SDK 2.2.0 . 回答1: PCollection's are unordered per se, so the notion of "first N elements" does not exist - however: In case you need the top N elements by some criterion, you can use the Top

Check if PCollection is empty - Apache Beam

感情迁移 提交于 2019-12-08 13:29:20
问题 Is there any way to check if a PCollection is empty? I haven't found anything relevant in the documentation of Dataflow and Apache Beam. 回答1: There is no way to check size of the PCollection without applying a PTransform on it (such as Count.globally() or Combine.combineFn()) because PCollection is not like a typical Collection in Java SDK or so. It is an abstraction of bounded or unbounded collection of data where data is fed into the collection for an operation being applied on it (e.g.

Tensorflow transform on beams with flink runner

戏子无情 提交于 2019-12-08 13:08:32
It may seem stupid but it is my very first post here. Sorry for doing anything wrong. I am currently building a simple ML pipeline with TFX 0.11 (i.e. tfdv-tft-tfserving) and tensorflow 1.11, using python2.7. I currently have a apache-flink cluster and I want to use that for TFX. I know the framework behind TFX is apache-beams 2.8, and it (apache-beams) supports flink with python SDK currently through a portable runner layer. But the problem is how I can code in TFX (tfdv-tft) using apache-beams with flink runner through this portable runner concept, as TFX currently seems to only support

How to create a personalised WindowFn in google dataflow

落花浮王杯 提交于 2019-12-08 12:25:15
问题 I'd like to create a different WindowFn in a such way to assign Windows to any of my input elements based on another field instead of based on my input entry's timestamp. I know the pre-defined WindowFn 's from Google DataFlow SDK use the timestamp as a criteria to assign window. More specifically I'd like to create a kind of SlidingWindows but instead of considering timestamp as the Window assignment criteria I'd like to consider another field as that criteria. How could I create my