Read files from a PCollection of GCS filenames in Pipeline?

孤街醉人 提交于 2020-01-13 09:28:25

问题


I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process).

Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following:

Get the topic from pub/sub ParDo to read each file and get the lines Process the lines of the file...

Could I use the FileBasedReader or something similar in this case to read the files? The files aren't too big so I wouldn't need to parallelize the reading of a single file, but I would need to read a lot of files.


回答1:


You can use the TextIO.readAll() transform, which has been recently added to Beam in #3443. For example:

PCollection<String> filenames = p.apply(PubsubIO.readStrings()...);
PCollection<String> lines = filenames.apply(TextIO.readAll());

This will read all lines in each file arriving over pubsub.



来源:https://stackoverflow.com/questions/32277968/read-files-from-a-pcollection-of-gcs-filenames-in-pipeline

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!