Read Files from multiple folders in Apache Beam and map outputs to filenames

限于喜欢 提交于 2019-12-05 04:19:49

As you found, ReadFromText doesn't currently support dynamic filenames and you definitely don't want to create individual steps for the each URL. From your initial sentence I understand you want get the filename and the file content out as one item. That means you won't need or benefit from any streaming of parts of the file. You can simply read the file contents. Something like:

import apache_beam as beam
from apache_beam.io.filesystems import FileSystems


def read_all_from_url(url):
    with FileSystems.open(url) as f:
        return f.read()


def read_from_urls(pipeline, urls):
    return (
        pipeline
        | beam.Create(urls)
        | 'Read File' >> beam.Map(lambda url: (
            url,
            read_all_from_url(url)
        ))
    )

You can customise it if you think you're having issues with metadata. The output will be a tuple (url, file contents). If your file contents is very large you might need a slightly different approach depending on your use case.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!