问题
I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process).
Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following:
Get the topic from pub/sub ParDo to read each file and get the lines Process the lines of the file...
Could I use the FileBasedReader or something similar in this case to read the files? The files aren't too big so I wouldn't need to parallelize the reading of a single file, but I would need to read a lot of files.
回答1:
You can use the TextIO.readAll()
transform, which has been recently added to Beam in #3443. For example:
PCollection<String> filenames = p.apply(PubsubIO.readStrings()...);
PCollection<String> lines = filenames.apply(TextIO.readAll());
This will read all lines in each file arriving over pubsub.
来源:https://stackoverflow.com/questions/32277968/read-files-from-a-pcollection-of-gcs-filenames-in-pipeline