问题
Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store locally)– I have login credentials, and sudo rights of this VM. I have minimal data analytics experience, and no experience working with datasets over a few thousand rows in size.
The following piece of code belongs to a short, helper program that will make a dask dataframe of a large csv file hosted on the VM. I want to later pass its output (reference to the dask dataframe) to a second function that will perform some overview analysis on it.
import dask.dataframe as dd
import paramiko as pm
import pandas as pd
import sys
def remote_file_to_dask_dataframe(remote_path):
if isinstance(remote_path, (str)):
try:
client = pm.SSHClient()
client.load_system_host_keys()
client.connect('#myserver', username='my_username', password='my_password')
sftp_client = client.open_sftp()
remote_file = sftp_client.open(remote_path)
df = dd.read_csv(remote_file)
remote_file.close()
sftp_client.close()
return df
except:
print("An error occurred.")
sftp_client.close()
remote_file.close()
else:
raise ValueError("Path to remote file as string required")
The code is neither nice nor complete, and I will replace username and password with ssh keys in time, but this is not the issue. In a jupyter notebook, I've previously opened the sftp connection with a path to a file on the server, and read it into a dataframe with a regular Pandas read_csv call. However, here the equivalent line, using Dask, is the source of the problem:df = dd.read_csv(remote_file).
I've looked at the documentation online (here), but I can't tell whether what I'm trying above is possible. It seems that for networked options, Dask wants a url. The parameter passing options for, e.g. S3, appear to depend on that infrastructure's backend. I unfortunately cannot make any sense of the dash-ssh documentation (here).
I've poked around with print statements and the only line that fails to execute is the one stated. The error risen is: raise TypeError('url type not understood: %s' % urlpath) TypeError: url type not understood:
Can anybody point me in the right direction for achieving what I'm trying to do? I'd expected Dask's read_csv to function as Pandas' had, as it's based on the same.
I'd appreciate any help, thanks.
p.s. I'm aware of Pandas' read_csv chunksize option, but I would like to achieve this through Dask, if possible.
回答1:
In the master version of Dask, file-system operations are now using fsspec which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations.
In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.
回答2:
It seems that you would have to implement their "file system" interface.
I'm not sure what is minimal set of methods that you need to implement to allow read_csv. But you definitely have to implement the open.
class SftpFileSystem(object):
def open(self, path, mode='rb', **kwargs):
return sftp_client.open(path, mode)
dask.bytes.core._filesystems['sftp'] = SftpFileSystem
df = dd.read_csv('sftp://remote/path/file.csv')
来源:https://stackoverflow.com/questions/56623297/is-it-possible-to-read-a-csv-from-a-remote-server-using-paramiko-and-dasks-re