apache-beam | 易学教程

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

阅读更多关于 How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

问题 I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/... When reading only a single topic I used TextIO in the last step of my pipeline: TextIO.write() .to( new DateNamedFiles( String.format("gs://bucket/data%s/", suffix), currentMillisString)) .withWindowedWrites() .withTempDirectory( FileBasedSink

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

阅读更多关于 How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

Discrepancy in running Apache Beam Data Generator with DirectRunner and FlinkRunner

阅读更多关于 Discrepancy in running Apache Beam Data Generator with DirectRunner and FlinkRunner

问题 This question is related to my earlier post about benchmarking Apache Beam with an on-the-fly data generator. I have the following code to generate data within my pipeline: PCollection<Long> data = pipeline.apply(GenerateSequence.from(1) .withMaxReadTime(Duration.millis(3000))); //Print generated data data.apply(ParDo.of(new DoFn<Long, String>() { @ProcessElement public void processElement(@Element Long input) { System.out.println(input); } })); pipeline.run(); If I run this code with

code logic not working as expected. mistake in my logic in apache beam on google cloud

阅读更多关于 code logic not working as expected. mistake in my logic in apache beam on google cloud

问题 i am trying to implement CDC in apache_beam. Here, i have unloaded the master data and the new data, which is expected to coming daily. The join is not working as expected. Something is amiss. can anyone please assist in rectifying my mistake. Am i missing any step. master_data = ( p | 'Read base from BigQuery ' >> beam.io.Read( beam.io.BigQuerySource(query=master_data, use_standard_sql=True)) | 'Map id in master' >> beam.Map( lambda master: ( master['id'], master ))) new_data = ( p | 'Read

Apache Beam with Flink backend throws NoSuchMethodError on calls to protobuf-java library methods

阅读更多关于 Apache Beam with Flink backend throws NoSuchMethodError on calls to protobuf-java library methods

问题 I'm trying to run a simple pipeline on local cluster using Protocol Buffer to pass data between Beam functions. The com.google.protobuf:protobuf-java is included in FatJar. Everything works fine if I run it through: java -jar target/dataflow-test-1.0-SNAPSHOT.jar \ --runner=org.apache.beam.runners.flink.FlinkRunner \ --input=/tmp/kinglear.txt --output=/tmp/wordcounts.txt But it fails when trying to run on flink cluster: flink run target/dataflow-test-1.0-SNAPSHOT.jar \ --runner=org.apache

Google DataFlow: attaching filename to the message

阅读更多关于 Google DataFlow: attaching filename to the message

问题 I'm trying to build Google DataFlow pipeline, which has these steps: Read from pub/sub topic a message which contains filename. Find in the google bucket file from filename read each line from the file send each line with filename as a single message to another topic My problem is that I can't add filename to the final output message. Current implementation: ConnectorOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(ConnectorOptions.class); Pipeline p = Pipeline

Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

阅读更多关于 Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

问题 I am considering Google DataFlow as an option for running a pipeline that involves steps like: Downloading images from the web; Processing images. I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to start or stop them myself, but all examples I came across use it for data mining kind of tasks. I wonder if it is a viable option for other batch tasks like image processing and crawling. 回答1: This use case is a possible application for Dataflow/Beam.

How to make the environment variables reach Dataflow workers as environment variables in python sdk

阅读更多关于 How to make the environment variables reach Dataflow workers as environment variables in python sdk

问题 I write custom sink with python sdk. I try to store data to AWS S3. To connect S3, some credential, secret key, is necessary, but it's not good to set in code for security reason. I would like to make the environment variables reach Dataflow workers as environment variables. How can I do it? 回答1: Generally, for transmitting information to workers that you don't want to hard-code, you should use PipelineOptions - please see Creating Custom Options. Then, when constructing the pipeline, just

How to make the environment variables reach Dataflow workers as environment variables in python sdk

阅读更多关于 How to make the environment variables reach Dataflow workers as environment variables in python sdk

How to write result to JSON files in gcs in Dataflow/Beam

阅读更多关于 How to write result to JSON files in gcs in Dataflow/Beam

问题 I'm using the Python Beam SDK 0.6.0. And I would like to write my output to JSON in Google Cloud Storage. What is the best way to do this? I quess I can use WriteToText from the Text IO sink but then I have to format each row separately, right? How do I save my result into valid JSON files that contain lists of objects? 回答1: Ok, for reference, I solved this by writing my own sink building on the _TextSink used by WriteToText in the beam SDK. Not sure if this is the best way to do it but it