apache-beam

How can I implement zipWithIndex like Spark in Apache Beam?

◇◆丶佛笑我妖孽 提交于 2019-12-14 03:11:50
问题 Pcollection<String> p1 = {"a","b","c"} PCollection< KV<Integer,String> > p2 = p1.apply("some operation ") //{(1,"a"),(2,"b"),(3,"c")} I need to make it scalable for large file like Apache Spark such that it works like: sc.textFile("./filename").zipWithIndex My goal is to preserve the order between rows within a large file by assigning row numbers in a scalable way. How can I get the result by Apache Beam? Some related posts: zipWithIndex on Apache Flink Ranking pcollection elements 回答1: There

Reading from Pubsub using Dataflow Java SDK 2

放肆的年华 提交于 2019-12-14 02:17:57
问题 A lot of the documentation for the Google Cloud Platform for Java SDK 2.x tell you to reference Beam documentation. When reading from PubSub using Dataflow, should I still be doing PubsubIO.Read.named("name").topic(""); Or should I be doing something else? Also building off of that, is there a way to just print PubSub data received by the Dataflow to standard output or to a file? 回答1: For Apache Beam 2.2.0, you can define the following transform to pull messages from a Pub/Sub subscription:

Apache Beam number of times a pane is fired with early triggers

我的未来我决定 提交于 2019-12-13 17:43:11
问题 In a streaming beam pipeline, a trigger is set to be Window.into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark .pastEndOfWindow() .withEarlyFirings(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(15)))) .withAllowedLateness(Duration.standardHours(1)) .accumulatingFiredPanes()) If there's no new data between the early firing (15 minutes after the first element of the current window) and the watermark, will there be another firing at

Sample in Dataflow / Beam with Python

戏子无情 提交于 2019-12-13 17:30:23
问题 I'm trying to get a sample of the items in PCollection using the Python SDK on Dataflow / Beam. While it's not documented, Sample.FixedSizeGlobally(n) exists. When testing, it seems to return a PCollection with a single item: a list containing the samples, rather than a PCollection with the samples. Is that correct? Is doing this the best way of turning that single-item PCollection into a PCollection of the items? | Sample.FixedSizeGlobally(sample_size) | beam.FlatMap(lambda x: x) 回答1:

Apache beam: No Runner was specified and the DirectRunner was not found on the classpath

守給你的承諾、 提交于 2019-12-13 16:07:54
问题 I am building a gradle java project (please refer below) using Apache Beam code and executing on Eclipse Oxygen. package com.xxxx.beam; import java.io.IOException; import org.apache.beam.runners.spark.SparkContextOptions; import org.apache.beam.runners.spark.SparkPipelineResult; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineRunner; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk

How do I resolve a Pickling Error on class apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum?

血红的双手。 提交于 2019-12-13 14:07:15
问题 A PicklingError is raised when I run my data pipeline remotely: the data pipeline has been written using the Beam SDK for Python and I am running it on top of Google Cloud Dataflow. The pipeline works fine when I run it locally. The following code generates the PicklingError: this ought to reproduce the problem import apache_beam as beam from apache_beam.transforms import pvalue from apache_beam.io.fileio import _CompressionType from apache_beam.utils.options import PipelineOptions from

Need to insert rows in clickhouseIO from apache beam(dataflow)

徘徊边缘 提交于 2019-12-13 12:51:23
问题 I am reading from a Pub/Sub topic which running fine now I need to insert into a Table on clickHouse. I am learning please excuse the tardiness. PipelineOptions options = PipelineOptionsFactory.create(); //PubSubToDatabasesPipelineOptions options; Pipeline p = Pipeline.create(options); PCollection<String> inputFromPubSub = p.apply(namePrefix + "ReadFromPubSub", PubsubIO.readStrings().fromSubscription("projects/*********/subscriptions/crypto_bitcoin.dataflow.bigquery.transactions")

Streaming pipelines with BigQuery sinks in python

寵の児 提交于 2019-12-13 12:21:04
问题 I'm building an apache beam streaming pipeline whose source is Pubsub and sink is BigQuery. I've gotten the error messsage: "Workflow failed. Causes: Unknown message code." As cryptic as this message is I now believe it to be the case that BigQuery is not supported as a sink for streaming pipelines, it says this here: Streaming from Pub/Sub to BigQuery Am I certainly correct that this is what's causing the problem? Or if not is it still not supported in any case? Can anyone hint at when this

Java Apache Beam - save file “LOCALY” by using DataflowRunner

邮差的信 提交于 2019-12-13 05:23:52
问题 Can send the java code but currently, it's not necessary. I have an issue as when I run the job as (DirectRunner - using Google VM Instance) it is working fine, as it saves information to the local file and carries on... The problem appears when trying to use (DataflowRunner), and the error which I receive: java.nio.file.NoSuchFileExtension: XXXX.csv ..... ..... XXXX.csv could not be delete. It could be deleted as it not even created. Problem - how to write the file locally when running

Apache Beam - skip pipeline step

◇◆丶佛笑我妖孽 提交于 2019-12-13 04:39:21
问题 I'm using Apache Beam to set up a pipeline consisting of 2 main steps: transform the data using a Beam Transform load the transformed data to BigQuery The pipeline setup looks like this: myPCollection = (org.apache.beam.sdk.values.PCollection<myCollectionObjectType>)myInputPCollection .apply("do a parallel transform"), ParDo.of(new MyTransformClassName.MyTransformFn())); myPCollection .apply("Load BigQuery data for PCollection", BigQueryIO.<myCollectionObjectType>write() .to(new