问题
How to write messages received from PubSub to a text file in GCS using TextIO in Apache Beam? Saw some methods like withWindowedWrites() and withFilenamePolicy() but couldn't find any example of it in the documentation.
回答1:
Here is an example provided you are using the Java SDK (BEAM 2.1.0).
PipelineOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(PipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.begin()
.apply("PubsubIO",PubsubIO.readStrings()
.withTimestampAttribute("timestamp")
.fromSubscription("projects/YOUR-PROJECT/subscriptions/YOUR-SUBSCRIPTION"))
.apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(30L))))
.apply(TextIO.write().to("gs://YOUR-BUCKET").withWindowedWrites());
You can see the defaults that the SDK uses for the file naming by exploring the "expand" method in TextIO.Write.expand(PCollection input). Specifically I'd take a look at DefaultFilenamePolicy.java
来源:https://stackoverflow.com/questions/46909579/write-streaming-data-to-gcs-using-apache-beam