TextIO. Read multiple files from GCS using pattern {}

佐手、 提交于 2019-12-04 16:19:44

This may be another option, in addition to Scott's suggestion and your comment on his answer:

You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:

PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));

Then create a PCollectionList:

PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);

And then flatten this list into your PCollection for your main input:

PCollection<String> events = eventsList.apply(Flatten.pCollections());

Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.

GCS glob wildcard patterns are documented here (Wildcard Names).

In the case above, you could use:

TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")

Note however that this will also include any other matching files.

Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:

public TextIO.Read from(java.lang.String filepattern)

Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).

Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!