apache-beam

Executing a pipeline only after another one finishes on google dataflow

ぃ、小莉子 提交于 2019-12-07 23:10:17
问题 I want to run a pipeline on google dataflow that depends on the output of another pipeline. Right now I am just running two pipelines after each other with the DirectRunner locally: with beam.Pipeline(options=pipeline_options) as p: (p | beam.io.ReadFromText(known_args.input) | SomeTransform() | beam.io.WriteToText('temp')) with beam.Pipeline(options=pipeline_options) as p: (p | beam.io.ReadFromText('temp*') | AnotherTransform() | beam.io.WriteToText(known_args.output)) My questions are the

How to retrieve the content of a PCollection and assign it to a normal variable?

人盡茶涼 提交于 2019-12-07 22:32:35
问题 I am using Apache-Beam with the Python SDK. Currently, my pipeline reads multiple files, parse them and generate pandas dataframes from its data. Then, it groups them into a single dataframe. What I want now is to retrieve this single fat dataframe, assigning it to a normal Python variable. Is it possible to do? 回答1: PCollection is simply a logical node in the execution graph and its contents are not necessarily actually stored anywhere, so this is not possible directly. However, you can ask

Google Dataflow / Apache Beam Python - Side-Input from PCollection kills performance

泄露秘密 提交于 2019-12-07 17:40:53
问题 We are running logfile parsing jobs in google dataflow using the Python SDK. Data is spread over several 100s of daily logs, which we read via file-pattern from Cloud Storage. Data volume for all files is about 5-8 GB (gz files) with 50-80 million lines in total. loglines = p | ReadFromText('gs://logfile-location/logs*-20180101') In addition, we have a simple (small) mapping csv, that maps logfile-entries to human readable text. Has about 400 lines, 5 kb size. For Example a logfile entry with

Join two large volumne of PCollection has performance issue

☆樱花仙子☆ 提交于 2019-12-07 15:20:38
问题 Join two Pcollection with CoGroupsByKey approach taking hours to execute the 8+ millions records. Noted from another stackoverflow post CoGbkResult has more than 10000 elements,reiteration (which may be slow) is required that "CoGbkResult has more than 10000 elements, reiteration (which may be slow) is required." Any suggestion to improve this performance using this approach. Here is the code snippet, PCollection<TableRow> pc1 = ...; PCollection<TableRow> pc2 = ...; WithKeys<String, TableRow>

Apache Beam PubSubIO with GroupByKey

╄→гoц情女王★ 提交于 2019-12-07 14:23:05
问题 I'm trying with Apache Beam 2.1.0 to consume simple data (key,value) from google PubSub and group by key to be able to treat batches of data. With default trigger my code after "GroupByKey" never fires (I waited 30min). If I defined custom trigger, code is executed but I would like to understand why default trigger is never fired. I tried to define my own timestamp with "withTimestampLabel" but same issue. I tried to change duration of windows but same issue too (1second, 10seconds, 30seconds

Apache Beam TextIO glob get original filename

元气小坏坏 提交于 2019-12-07 12:44:52
问题 I have setup a pipeline. I have to parse hundreds of *.gz files. Therefore glob works quite good. But I need the original name of the currently processed file, because i want to name the result files as the original files. Can anyone help me here? Here is my code. @Default.String(LOGS_PATH + "*.gz") String getInputFile(); void setInputFile(String value); TextIO.Read read = TextIO.read().withCompressionType(TextIO.CompressionType.GZIP).from(options.getInputFile()); read.getName(); p.apply(

Google Cloud dataflow : Read from a file with dynamic filename

我怕爱的太早我们不能终老 提交于 2019-12-07 12:02:31
I am trying to build a pipeline on Google Cloud Dataflow that would do the following: Listen to events on Pubsub subscription Extract the filename from event text Read the file (from Google Cloud Storage bucket) Store the records in BigQuery Following is the code: Pipeline pipeline = //create pipeline pipeline.apply("read events", PubsubIO.readStrings().fromSubscription("sub")) .apply("Deserialise events", //Code that produces ParDo.SingleOutput<String, KV<String, byte[]>>) .apply(TextIO.read().from(""))??? I am struggling with 3rd step, not quite sure how to access the output of second step

Streaming MutationGroups into Spanner

一笑奈何 提交于 2019-12-07 10:05:49
问题 I'm trying to stream MutationGroups into spanner with SpannerIO. The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's. When I don't use any windows, I get the following error: Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey. at org.apache.beam.sdk.transforms.GroupByKey

Dataflow pipeline and pubsub emulator

无人久伴 提交于 2019-12-07 08:02:37
问题 I'm trying to setup my development environment. Instead of using google cloud pubsub in production, I've been using the pubsub emulator for development and testing. To achieve this I set the following environment variable: export PUBSUB_EMULATOR_HOST=localhost:8586 This worked for the python google pubsub library but when I switched to using java apache beam for google dataflow, the pipeline still points to production google pubsub. Is there a setting, environment variable or method on the

Custom Apache Beam Python version in Dataflow

橙三吉。 提交于 2019-12-07 07:41:35
问题 I am wondering if it is possible to have a custom Apache Beam Python version running in Google Dataflow. A version that is not available in the public repositories (as of this writing: 0.6.0 and 2.0.0). For example, the HEAD version from the official repository of Apache Beam, or a specific tag for that matter. I am aware of the possibility of packing custom packages (private local ones for example) as described in the official documentation. There are answered are questions here on how to do