Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

徘徊边缘 提交于 2019-12-08 08:10:38

问题


I tried to pass class paramiko.sftp_file.SFTPFile instead of file URL for pandas.read_parquet and it worked fine. But when I tried the same with Dask, it threw an error. Below is the code I tried to run and the error I get. How can I make this work?

import dask.dataframe as dd
import parmiko
ssh=paramiko.SSHClient()
sftp_client = ssh.open_sftp()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
source_file=sftp_client.open(str(parquet_file),'rb')
full_df = dd.read_parquet(source_file,engine='pyarrow')
print(len(full_df))
Traceback (most recent call last):
  File "C:\Users\rrrrr\Documents\jackets_dask.py", line 22, in <module>
    full_df = dd.read_parquet(source_file,engine='pyarrow')
  File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 1173, in read_parquet
    storage_options=storage_options
  File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\bytes\core.py", line 368, in get_fs_token_paths
    raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood: <paramiko.sftp_file.SFTPFile object at 0x0000007712D9A208>

回答1:


The situation has changed, and you can do this now directly with Dask. Paster answer from Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

In the master version of Dask, file-system operations are now using fsspec which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations.

In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.




回答2:


Dask does not support file-like objects directly.

You would have to implement their "file system" interface.

I'm not sure what is minimal set of methods that you need to implement to allow read_parquet. But you definitely have to implement the open. Something like this:

class SftpFileSystem(object):
    def open(self, path, mode='rb', **kwargs):
        return sftp_client.open(path, mode)

dask.bytes.core._filesystems['sftp'] = SftpFileSystem

df = dd.read_parquet('sftp://remote/path/file', engine='pyarrow')

There's actually am implementation of such file system for SFTP in fsspec library:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem

See also Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?



来源:https://stackoverflow.com/questions/56735362/passing-a-paramiko-connection-sftpfile-as-input-to-a-dask-dataframe-read-parquet

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!