问题
I tried to pass class paramiko.sftp_file.SFTPFile
instead of file URL for pandas.read_parquet
and it worked fine. But when I tried the same with Dask, it threw an error. Below is the code I tried to run and the error I get. How can I make this work?
import dask.dataframe as dd
import parmiko
ssh=paramiko.SSHClient()
sftp_client = ssh.open_sftp()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
source_file=sftp_client.open(str(parquet_file),'rb')
full_df = dd.read_parquet(source_file,engine='pyarrow')
print(len(full_df))
Traceback (most recent call last):
File "C:\Users\rrrrr\Documents\jackets_dask.py", line 22, in <module>
full_df = dd.read_parquet(source_file,engine='pyarrow')
File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 1173, in read_parquet
storage_options=storage_options
File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\bytes\core.py", line 368, in get_fs_token_paths
raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood: <paramiko.sftp_file.SFTPFile object at 0x0000007712D9A208>
回答1:
The situation has changed, and you can do this now directly with Dask. Paster answer from Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?
In the master version of Dask, file-system operations are now using fsspec
which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations.
In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.
回答2:
Dask does not support file-like objects directly.
You would have to implement their "file system" interface.
I'm not sure what is minimal set of methods that you need to implement to allow read_parquet
. But you definitely have to implement the open
. Something like this:
class SftpFileSystem(object):
def open(self, path, mode='rb', **kwargs):
return sftp_client.open(path, mode)
dask.bytes.core._filesystems['sftp'] = SftpFileSystem
df = dd.read_parquet('sftp://remote/path/file', engine='pyarrow')
There's actually am implementation of such file system for SFTP in fsspec library:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem
See also Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?
来源:https://stackoverflow.com/questions/56735362/passing-a-paramiko-connection-sftpfile-as-input-to-a-dask-dataframe-read-parquet