apache-beam | 易学教程

How can I implement zipWithIndex like Spark in Apache Beam?

阅读更多关于 How can I implement zipWithIndex like Spark in Apache Beam?

问题 Pcollection<String> p1 = {"a","b","c"} PCollection< KV<Integer,String> > p2 = p1.apply("some operation ") //{(1,"a"),(2,"b"),(3,"c")} I need to make it scalable for large file like Apache Spark such that it works like: sc.textFile("./filename").zipWithIndex My goal is to preserve the order between rows within a large file by assigning row numbers in a scalable way. How can I get the result by Apache Beam? Some related posts: zipWithIndex on Apache Flink Ranking pcollection elements 回答1: There

Reading from Pubsub using Dataflow Java SDK 2

阅读更多关于 Reading from Pubsub using Dataflow Java SDK 2

问题 A lot of the documentation for the Google Cloud Platform for Java SDK 2.x tell you to reference Beam documentation. When reading from PubSub using Dataflow, should I still be doing PubsubIO.Read.named("name").topic(""); Or should I be doing something else? Also building off of that, is there a way to just print PubSub data received by the Dataflow to standard output or to a file? 回答1: For Apache Beam 2.2.0, you can define the following transform to pull messages from a Pub/Sub subscription:

Apache Beam number of times a pane is fired with early triggers

阅读更多关于 Apache Beam number of times a pane is fired with early triggers

问题 In a streaming beam pipeline, a trigger is set to be Window.into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark .pastEndOfWindow() .withEarlyFirings(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(15)))) .withAllowedLateness(Duration.standardHours(1)) .accumulatingFiredPanes()) If there's no new data between the early firing (15 minutes after the first element of the current window) and the watermark, will there be another firing at

Sample in Dataflow / Beam with Python

阅读更多关于 Sample in Dataflow / Beam with Python

问题 I'm trying to get a sample of the items in PCollection using the Python SDK on Dataflow / Beam. While it's not documented, Sample.FixedSizeGlobally(n) exists. When testing, it seems to return a PCollection with a single item: a list containing the samples, rather than a PCollection with the samples. Is that correct? Is doing this the best way of turning that single-item PCollection into a PCollection of the items? | Sample.FixedSizeGlobally(sample_size) | beam.FlatMap(lambda x: x) 回答1:

Apache beam: No Runner was specified and the DirectRunner was not found on the classpath

阅读更多关于 Apache beam: No Runner was specified and the DirectRunner was not found on the classpath

问题 I am building a gradle java project (please refer below) using Apache Beam code and executing on Eclipse Oxygen. package com.xxxx.beam; import java.io.IOException; import org.apache.beam.runners.spark.SparkContextOptions; import org.apache.beam.runners.spark.SparkPipelineResult; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineRunner; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk

How do I resolve a Pickling Error on class apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum?

阅读更多关于 How do I resolve a Pickling Error on class apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum?

问题 A PicklingError is raised when I run my data pipeline remotely: the data pipeline has been written using the Beam SDK for Python and I am running it on top of Google Cloud Dataflow. The pipeline works fine when I run it locally. The following code generates the PicklingError: this ought to reproduce the problem import apache_beam as beam from apache_beam.transforms import pvalue from apache_beam.io.fileio import _CompressionType from apache_beam.utils.options import PipelineOptions from

Need to insert rows in clickhouseIO from apache beam(dataflow)

阅读更多关于 Need to insert rows in clickhouseIO from apache beam(dataflow)

问题 I am reading from a Pub/Sub topic which running fine now I need to insert into a Table on clickHouse. I am learning please excuse the tardiness. PipelineOptions options = PipelineOptionsFactory.create(); //PubSubToDatabasesPipelineOptions options; Pipeline p = Pipeline.create(options); PCollection<String> inputFromPubSub = p.apply(namePrefix + "ReadFromPubSub", PubsubIO.readStrings().fromSubscription("projects/*********/subscriptions/crypto_bitcoin.dataflow.bigquery.transactions")

Streaming pipelines with BigQuery sinks in python

阅读更多关于 Streaming pipelines with BigQuery sinks in python

问题 I'm building an apache beam streaming pipeline whose source is Pubsub and sink is BigQuery. I've gotten the error messsage: "Workflow failed. Causes: Unknown message code." As cryptic as this message is I now believe it to be the case that BigQuery is not supported as a sink for streaming pipelines, it says this here: Streaming from Pub/Sub to BigQuery Am I certainly correct that this is what's causing the problem? Or if not is it still not supported in any case? Can anyone hint at when this

Java Apache Beam - save file “LOCALY” by using DataflowRunner

阅读更多关于 Java Apache Beam - save file “LOCALY” by using DataflowRunner

问题 Can send the java code but currently, it's not necessary. I have an issue as when I run the job as (DirectRunner - using Google VM Instance) it is working fine, as it saves information to the local file and carries on... The problem appears when trying to use (DataflowRunner), and the error which I receive: java.nio.file.NoSuchFileExtension: XXXX.csv ..... ..... XXXX.csv could not be delete. It could be deleted as it not even created. Problem - how to write the file locally when running

Apache Beam - skip pipeline step

阅读更多关于 Apache Beam - skip pipeline step

问题 I'm using Apache Beam to set up a pipeline consisting of 2 main steps: transform the data using a Beam Transform load the transformed data to BigQuery The pipeline setup looks like this: myPCollection = (org.apache.beam.sdk.values.PCollection<myCollectionObjectType>)myInputPCollection .apply("do a parallel transform"), ParDo.of(new MyTransformClassName.MyTransformFn())); myPCollection .apply("Load BigQuery data for PCollection", BigQueryIO.<myCollectionObjectType>write() .to(new