apache-beam

How to match multiple files with names using TextIO.Read in Cloud Dataflow

橙三吉。 提交于 2019-12-23 18:28:01
问题 I have a gcs folder as below: gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv /dt=2017-12-02/part-0000.tsv /dt=2017-12-03/part-0000.tsv /dt=2017-12-04/part-0000.tsv ... I want to match only the files under dt=2017-12-02 and dt=2017-12-03 using sc.textFile() in Scio, which uses TextIO.Read.from() underneath as far as I know. I've tried gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv and gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv Both match zero

How to get PipelineOptions in composite PTransform in Beam 2.0?

纵饮孤独 提交于 2019-12-23 15:25:11
问题 After upgrading to Beam 2.0 the Pipeline class doesn't have getOptions() class anymore. I have a composite PTransform that relies on getting the options in the its expand method: public class MyCompositeTransform extends PTransform<PBegin, PDone> { @Override public PDone expand(PBegin input) { Pipeline pipeline = input.getPipeline(); MyPipelineOptions options = pipeline.getOptions().as(MyPipelineOptions.class); ... } } In Beam 2.0 there doesn't seem to be a way to access the PipelineOptions

Session windows in Apache Beam with python

狂风中的少年 提交于 2019-12-23 05:11:36
问题 I have a stream of user events. I've mapped them into KV{ userId, event }, and assigned timestamps. This is to run in streaming mode. I would like to have be able to create the following input-output result: session window gap=1 input: user=1, timestamp=1, event=a input: user=2, timestamp=2, event=a input: user=2, timestamp=3, event=a input: user=1, timestamp=2, event=b time: lwm=3 output: user=1, [ { event=a, timestamp=1 }, { event=b, timestamp=2 } ] time: lwm=4 output: user=2, [ { event=a,

Session windows in Apache Beam with python

夙愿已清 提交于 2019-12-23 05:10:07
问题 I have a stream of user events. I've mapped them into KV{ userId, event }, and assigned timestamps. This is to run in streaming mode. I would like to have be able to create the following input-output result: session window gap=1 input: user=1, timestamp=1, event=a input: user=2, timestamp=2, event=a input: user=2, timestamp=3, event=a input: user=1, timestamp=2, event=b time: lwm=3 output: user=1, [ { event=a, timestamp=1 }, { event=b, timestamp=2 } ] time: lwm=4 output: user=2, [ { event=a,

Parallelism Problem on Cloud Dataflow using Go SDK

北城余情 提交于 2019-12-23 04:08:12
问题 I have Apache Beam code implementation on Go SDK as described below. The pipeline has 3 steps. One is textio.Read , other one is CountLines and the last step is ProcessLines . ProcessLines step takes around 10 seconds time. I just added a Sleep function for the sake of brevity. I am calling the pipeline with 20 workers. When I run the pipeline, my expectation was 20 workers would run in parallel and textio.Read read 20 lines from the file and ProcessLines would do 20 parallel executions in 10

Apache Beam: Trigger for Fixed Window

岁酱吖の 提交于 2019-12-23 01:56:11
问题 According to following documentation, it is stated that if you don't explicitly specify a trigger you get behavior described below: If unspecified, the default behavior is to trigger first when the watermark passes the end of the window, and then trigger again every time there is late arriving data. Is this behavior true for FixedWindow as well? For example you would assume fixed window should have a default trigger of repeatedly firing after watermark passes end of window, and discard all

Pre Processing Data for Tensorflow: InvalidArgumentError

▼魔方 西西 提交于 2019-12-23 01:37:38
问题 When I run my tensorflow model I am receiving this error InvalidArgumentError: Field 4 in record 0 is not a valid float: latency [[Node: DecodeCSV = DecodeCSV[OUT_TYPE=[DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING], field_delim=",", na_value="", use_quote_delim=true](arg0, DecodeCSV/record_defaults_0, DecodeCSV/record_defaults_1, DecodeCSV/record_defaults_2, DecodeCSV/record_defaults_3,

Singleton in Google Dataflow

拥有回忆 提交于 2019-12-23 01:24:46
问题 I have a dataflow which reads the messages from PubSub. I need to enrich this message using couple of API's. I want to have a single instance of this API to used for processing all records. This is to avoid initializing API for every request. I tried creating a static variable, but still I see the API is initialized many times. How to avoid initializing of a variable multiple times in Google Dataflow? 回答1: Dataflow uses multiple machines in parallel to do data analysis, so your API will have

Tensorflow transform on beams with flink runner

£可爱£侵袭症+ 提交于 2019-12-23 01:24:34
问题 It may seem stupid but it is my very first post here. Sorry for doing anything wrong. I am currently building a simple ML pipeline with TFX 0.11 (i.e. tfdv-tft-tfserving) and tensorflow 1.11, using python2.7. I currently have a apache-flink cluster and I want to use that for TFX. I know the framework behind TFX is apache-beams 2.8, and it (apache-beams) supports flink with python SDK currently through a portable runner layer. But the problem is how I can code in TFX (tfdv-tft) using apache

A way to execute pipeline periodically from bounded source in Apache Beam

家住魔仙堡 提交于 2019-12-22 17:55:55
问题 I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner. It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once. Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds? Or is there a way to make the pipeline redoing