apache-beam

Apache beam WordCount running error in windows

纵然是瞬间 提交于 2019-12-12 16:56:40
问题 Trying to run WordCount example of Apache Beam (version 2.0.0) by first running $ mvn archetype:generate \ -DarchetypeGroupId=org.apache.beam \ -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \ -DarchetypeVersion=2.0.0 \ -DgroupId=org.example \ -DartifactId=word-count-beam \ -Dversion="0.1" \ -Dpackage=org.apache.beam.examples \ -DinteractiveMode=false then running $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=pom.xml -

HTTP Client in DoFn

一个人想着一个人 提交于 2019-12-12 13:25:19
问题 I would like to make POST request through a DoFn for a Apache Beam Pipeline running on Dataflow. For that, I have created a client which instanciate an HttpClosableClient configured on a PoolingHttpClientConnectionManager. However, I instanciate a client for each element that I process. How could I setup a persistent client used by all my elements? And is there other class for parallel and high-speed HTTP requests that I should use? 回答1: You can put the client into a member variable, use the

Failed to construct instance from factory method DataflowRunner#fromOptions in beamSql, apache beam

ぃ、小莉子 提交于 2019-12-12 13:18:19
问题 I'm specifying dataflow runner in my beamSql program below : DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setStagingLocation("gs://gcpbucket/staging"); options.setTempLocation("gs://gcpbucket/tmp"); options.setProject("beta-19xxxx"); options.setRunner(DataflowRunner.class); Pipeline p = Pipeline.create(options); But I'm getting below exception : Exception in thread "main" java.lang.RuntimeException: Failed to construct instance from

Python Apache Beam Side Input Assertion Error

懵懂的女人 提交于 2019-12-12 10:07:25
问题 I am still new to Apache Beam/Cloud Dataflow so I apologize if my understanding is not correct. I am trying to read a data file, ~30,000 rows long, through a pipeline. My simple pipeline first opened the csv from GCS, pulled the headers out of the data, ran the data through a ParDo/DoFn function, and then wrote all of the output into a csv back into GCS. This pipeline worked and was my first test. I then edited the pipeline to read the csv, pull out the headers, remove the headers from the

Apply TensorFlow Transform to transform/scale features in production

守給你的承諾、 提交于 2019-12-12 09:36:22
问题 Overview I followed the following guide to write TF Records, where I used tf.Transform to preprocess my features. Now, I would like to deploy my model, for which I need apply this preprocessing function on real live data. My Approach First, suppose I have 2 features: features = ['amount', 'age'] I have the transform_fn from the Apache Beam, residing in working_dir=gs://path-to-transform-fn/ Then I load the transform function using: tf_transform_output = tft.TFTransformOutput(working_dir) I

Can I use setWorkerCacheMb in Apache Beam 2.0+?

為{幸葍}努か 提交于 2019-12-12 05:37:42
问题 My Dataflow job (using Java SDK 2.1.0) is quite slow and it is going to take more than a day to process just 50GB. I just pull a whole table from BigQuery (50GB), join with one csv file from GCS (100+MB). https://cloud.google.com/dataflow/model/group-by-key I use sideInputs to perform join (the latter way in the documentation above) while I think using CoGroupByKey is more efficient, however I'm not sure that is the only reason my job is super slow. I googled and it looks by default, a cache

Unable to run multiple Pipelines in desired order by creating template in Apache Beam

血红的双手。 提交于 2019-12-12 04:46:02
问题 I have two separate Pipelines say 'P1' and 'P2'. As per my requirement I need to run P2 only after P1 has completely finished its execution. I need to get this entire operation done through a single Template. Basically Template gets created the moment it finds run() its way say p1.run(). So what I can see that I need to handle two different Pipelines using two different templates but that would not satisfy my strict order based Pipeline execution requirement. Another way I could think of

Extract value from ValueProvider in Apache Beam

左心房为你撑大大i 提交于 2019-12-12 04:37:59
问题 I have a runtime value that I'm getting in my Apache Beam program. I need to access that value but Beam does not allow me to read that value unless I'm reading it from within a transfrom like ParDo. If I try to access that value outside any transform, it gives me an error saying: "Not called from a runtime context". How to read such values? P.S. I'm using a template of the program. 回答1: A program (such as a template) is executed in two stages. In the first, the main method is evaluated to

Pick elements in processElement() - Apache Beam

ε祈祈猫儿з 提交于 2019-12-12 04:08:03
问题 I know that when we implement a ParDo transform, we pick up individual elements from our data(basically separated by "\n"). But what if I have an element that occupies two lines in my file. Can I apply my own condition to pick elements according to it? Or is it always necessary to have an element in a single line? 回答1: Reading of text files is controlled by TextIO , not by ParDo - I suppose that's what you meant. Indeed right now TextIO splits files into 1 element per line, however there is

Dataflow JAVA SDK : Take the code as an input, process at the backed

寵の児 提交于 2019-12-12 02:53:57
问题 Please support me to understand the implementation of following scenario. Suppose the user types the code written using data flow SDK commands in a text box at the front end. We need to get that code (let's say as a string) and execute at the back end. Does Data flow SDK provide a facility like a execution manager, to do such a thing? Also some resources to get familiar with such an implementation would be much appreciated. Thanks 回答1: Dataflow does not support the kind of dynamic evaluation