Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

你说的曾经没有我的故事 提交于 2019-12-06 22:10:39

The situation has changed, and you can do this now directly with Dask. Paster answer from Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

In the master version of Dask, file-system operations are now using fsspec which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations.

In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.

Dask does not support file-like objects directly.

You would have to implement their "file system" interface.

I'm not sure what is minimal set of methods that you need to implement to allow read_parquet. But you definitely have to implement the open. Something like this:

class SftpFileSystem(object):
    def open(self, path, mode='rb', **kwargs):
        return sftp_client.open(path, mode)

dask.bytes.core._filesystems['sftp'] = SftpFileSystem

df = dd.read_parquet('sftp://remote/path/file', engine='pyarrow')

There's actually am implementation of such file system for SFTP in fsspec library:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem

See also Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!