apache-beam | 易学教程

How to match multiple files with names using TextIO.Read in Cloud Dataflow

阅读更多关于 How to match multiple files with names using TextIO.Read in Cloud Dataflow

问题 I have a gcs folder as below: gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv /dt=2017-12-02/part-0000.tsv /dt=2017-12-03/part-0000.tsv /dt=2017-12-04/part-0000.tsv ... I want to match only the files under dt=2017-12-02 and dt=2017-12-03 using sc.textFile() in Scio, which uses TextIO.Read.from() underneath as far as I know. I've tried gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv and gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv Both match zero

How to get PipelineOptions in composite PTransform in Beam 2.0?

阅读更多关于 How to get PipelineOptions in composite PTransform in Beam 2.0?

问题 After upgrading to Beam 2.0 the Pipeline class doesn't have getOptions() class anymore. I have a composite PTransform that relies on getting the options in the its expand method: public class MyCompositeTransform extends PTransform<PBegin, PDone> { @Override public PDone expand(PBegin input) { Pipeline pipeline = input.getPipeline(); MyPipelineOptions options = pipeline.getOptions().as(MyPipelineOptions.class); ... } } In Beam 2.0 there doesn't seem to be a way to access the PipelineOptions

Session windows in Apache Beam with python

阅读更多关于 Session windows in Apache Beam with python

问题 I have a stream of user events. I've mapped them into KV{ userId, event }, and assigned timestamps. This is to run in streaming mode. I would like to have be able to create the following input-output result: session window gap=1 input: user=1, timestamp=1, event=a input: user=2, timestamp=2, event=a input: user=2, timestamp=3, event=a input: user=1, timestamp=2, event=b time: lwm=3 output: user=1, [ { event=a, timestamp=1 }, { event=b, timestamp=2 } ] time: lwm=4 output: user=2, [ { event=a,

Session windows in Apache Beam with python

阅读更多关于 Session windows in Apache Beam with python

Parallelism Problem on Cloud Dataflow using Go SDK

阅读更多关于 Parallelism Problem on Cloud Dataflow using Go SDK

问题 I have Apache Beam code implementation on Go SDK as described below. The pipeline has 3 steps. One is textio.Read , other one is CountLines and the last step is ProcessLines . ProcessLines step takes around 10 seconds time. I just added a Sleep function for the sake of brevity. I am calling the pipeline with 20 workers. When I run the pipeline, my expectation was 20 workers would run in parallel and textio.Read read 20 lines from the file and ProcessLines would do 20 parallel executions in 10

Apache Beam: Trigger for Fixed Window

阅读更多关于 Apache Beam: Trigger for Fixed Window

问题 According to following documentation, it is stated that if you don't explicitly specify a trigger you get behavior described below: If unspecified, the default behavior is to trigger first when the watermark passes the end of the window, and then trigger again every time there is late arriving data. Is this behavior true for FixedWindow as well? For example you would assume fixed window should have a default trigger of repeatedly firing after watermark passes end of window, and discard all

Pre Processing Data for Tensorflow: InvalidArgumentError

阅读更多关于 Pre Processing Data for Tensorflow: InvalidArgumentError

问题 When I run my tensorflow model I am receiving this error InvalidArgumentError: Field 4 in record 0 is not a valid float: latency [[Node: DecodeCSV = DecodeCSV[OUT_TYPE=[DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING], field_delim=",", na_value="", use_quote_delim=true](arg0, DecodeCSV/record_defaults_0, DecodeCSV/record_defaults_1, DecodeCSV/record_defaults_2, DecodeCSV/record_defaults_3,

Singleton in Google Dataflow

阅读更多关于 Singleton in Google Dataflow

问题 I have a dataflow which reads the messages from PubSub. I need to enrich this message using couple of API's. I want to have a single instance of this API to used for processing all records. This is to avoid initializing API for every request. I tried creating a static variable, but still I see the API is initialized many times. How to avoid initializing of a variable multiple times in Google Dataflow? 回答1: Dataflow uses multiple machines in parallel to do data analysis, so your API will have

Tensorflow transform on beams with flink runner

阅读更多关于 Tensorflow transform on beams with flink runner

问题 It may seem stupid but it is my very first post here. Sorry for doing anything wrong. I am currently building a simple ML pipeline with TFX 0.11 (i.e. tfdv-tft-tfserving) and tensorflow 1.11, using python2.7. I currently have a apache-flink cluster and I want to use that for TFX. I know the framework behind TFX is apache-beams 2.8, and it (apache-beams) supports flink with python SDK currently through a portable runner layer. But the problem is how I can code in TFX (tfdv-tft) using apache

A way to execute pipeline periodically from bounded source in Apache Beam

阅读更多关于 A way to execute pipeline periodically from bounded source in Apache Beam

问题 I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner. It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once. Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds? Or is there a way to make the pipeline redoing