apache-beam

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

本小妞迷上赌 提交于 2020-01-06 14:13:40
问题 I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/... When reading only a single topic I used TextIO in the last step of my pipeline: TextIO.write() .to( new DateNamedFiles( String.format("gs://bucket/data%s/", suffix), currentMillisString)) .withWindowedWrites() .withTempDirectory( FileBasedSink

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

↘锁芯ラ 提交于 2020-01-06 14:13:32
问题 I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/... When reading only a single topic I used TextIO in the last step of my pipeline: TextIO.write() .to( new DateNamedFiles( String.format("gs://bucket/data%s/", suffix), currentMillisString)) .withWindowedWrites() .withTempDirectory( FileBasedSink

Discrepancy in running Apache Beam Data Generator with DirectRunner and FlinkRunner

大憨熊 提交于 2020-01-06 07:12:42
问题 This question is related to my earlier post about benchmarking Apache Beam with an on-the-fly data generator. I have the following code to generate data within my pipeline: PCollection<Long> data = pipeline.apply(GenerateSequence.from(1) .withMaxReadTime(Duration.millis(3000))); //Print generated data data.apply(ParDo.of(new DoFn<Long, String>() { @ProcessElement public void processElement(@Element Long input) { System.out.println(input); } })); pipeline.run(); If I run this code with

code logic not working as expected. mistake in my logic in apache beam on google cloud

大城市里の小女人 提交于 2020-01-06 04:37:09
问题 i am trying to implement CDC in apache_beam. Here, i have unloaded the master data and the new data, which is expected to coming daily. The join is not working as expected. Something is amiss. can anyone please assist in rectifying my mistake. Am i missing any step. master_data = ( p | 'Read base from BigQuery ' >> beam.io.Read( beam.io.BigQuerySource(query=master_data, use_standard_sql=True)) | 'Map id in master' >> beam.Map( lambda master: ( master['id'], master ))) new_data = ( p | 'Read

Apache Beam with Flink backend throws NoSuchMethodError on calls to protobuf-java library methods

纵饮孤独 提交于 2020-01-06 03:41:07
问题 I'm trying to run a simple pipeline on local cluster using Protocol Buffer to pass data between Beam functions. The com.google.protobuf:protobuf-java is included in FatJar. Everything works fine if I run it through: java -jar target/dataflow-test-1.0-SNAPSHOT.jar \ --runner=org.apache.beam.runners.flink.FlinkRunner \ --input=/tmp/kinglear.txt --output=/tmp/wordcounts.txt But it fails when trying to run on flink cluster: flink run target/dataflow-test-1.0-SNAPSHOT.jar \ --runner=org.apache

Google DataFlow: attaching filename to the message

五迷三道 提交于 2020-01-04 09:21:44
问题 I'm trying to build Google DataFlow pipeline, which has these steps: Read from pub/sub topic a message which contains filename. Find in the google bucket file from filename read each line from the file send each line with filename as a single message to another topic My problem is that I can't add filename to the final output message. Current implementation: ConnectorOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(ConnectorOptions.class); Pipeline p = Pipeline

Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

半城伤御伤魂 提交于 2020-01-02 05:47:26
问题 I am considering Google DataFlow as an option for running a pipeline that involves steps like: Downloading images from the web; Processing images. I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to start or stop them myself, but all examples I came across use it for data mining kind of tasks. I wonder if it is a viable option for other batch tasks like image processing and crawling. 回答1: This use case is a possible application for Dataflow/Beam.

How to make the environment variables reach Dataflow workers as environment variables in python sdk

馋奶兔 提交于 2020-01-01 09:44:07
问题 I write custom sink with python sdk. I try to store data to AWS S3. To connect S3, some credential, secret key, is necessary, but it's not good to set in code for security reason. I would like to make the environment variables reach Dataflow workers as environment variables. How can I do it? 回答1: Generally, for transmitting information to workers that you don't want to hard-code, you should use PipelineOptions - please see Creating Custom Options. Then, when constructing the pipeline, just

How to make the environment variables reach Dataflow workers as environment variables in python sdk

风流意气都作罢 提交于 2020-01-01 09:44:02
问题 I write custom sink with python sdk. I try to store data to AWS S3. To connect S3, some credential, secret key, is necessary, but it's not good to set in code for security reason. I would like to make the environment variables reach Dataflow workers as environment variables. How can I do it? 回答1: Generally, for transmitting information to workers that you don't want to hard-code, you should use PipelineOptions - please see Creating Custom Options. Then, when constructing the pipeline, just

How to write result to JSON files in gcs in Dataflow/Beam

走远了吗. 提交于 2020-01-01 03:42:06
问题 I'm using the Python Beam SDK 0.6.0. And I would like to write my output to JSON in Google Cloud Storage. What is the best way to do this? I quess I can use WriteToText from the Text IO sink but then I have to format each row separately, right? How do I save my result into valid JSON files that contain lists of objects? 回答1: Ok, for reference, I solved this by writing my own sink building on the _TextSink used by WriteToText in the beam SDK. Not sure if this is the best way to do it but it