Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store locally)– I have login credentials, and sudo rights of this VM. I have minimal data analytics experience, and no experience working with datasets over a few thousand rows in size.
The following piece of code belongs to a short, helper program that will make a dask dataframe of a large csv file hosted on the VM. I want to later pass its output (reference to the dask dataframe) to a second function that will perform some overview analysis on it.
import dask.dataframe as dd
import paramiko as pm
import pandas as pd
import sys
def remote_file_to_dask_dataframe(remote_path):
if isinstance(remote_path, (str)):
try:
client = pm.SSHClient()
client.load_system_host_keys()
client.connect('#myserver', username='my_username', password='my_password')
sftp_client = client.open_sftp()
remote_file = sftp_client.open(remote_path)
df = dd.read_csv(remote_file)
remote_file.close()
sftp_client.close()
return df
except:
print("An error occurred.")
sftp_client.close()
remote_file.close()
else:
raise ValueError("Path to remote file as string required")
The code is neither nice nor complete, and I will replace username and password with ssh keys in time, but this is not the issue. In a jupyter notebook, I've previously opened the sftp connection with a path to a file on the server, and read it into a dataframe with a regular Pandas read_csv call. However, here the equivalent line, using Dask, is the source of the problem:df = dd.read_csv(remote_file)
.
I've looked at the documentation online (here), but I can't tell whether what I'm trying above is possible. It seems that for networked options, Dask wants a url. The parameter passing options for, e.g. S3, appear to depend on that infrastructure's backend. I unfortunately cannot make any sense of the dash-ssh documentation (here).
I've poked around with print statements and the only line that fails to execute is the one stated. The error risen is: raise TypeError('url type not understood: %s' % urlpath) TypeError: url type not understood:
Can anybody point me in the right direction for achieving what I'm trying to do? I'd expected Dask's read_csv
to function as Pandas' had, as it's based on the same.
I'd appreciate any help, thanks.
p.s. I'm aware of Pandas' read_csv
chunksize option, but I would like to achieve this through Dask, if possible.
In the master version of Dask, file-system operations are now using fsspec
which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations
.
In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.
It seems that you would have to implement their "file system" interface.
I'm not sure what is minimal set of methods that you need to implement to allow read_csv
. But you definitely have to implement the open
.
class SftpFileSystem(object):
def open(self, path, mode='rb', **kwargs):
return sftp_client.open(path, mode)
dask.bytes.core._filesystems['sftp'] = SftpFileSystem
df = dd.read_csv('sftp://remote/path/file.csv')
来源:https://stackoverflow.com/questions/56623297/is-it-possible-to-read-a-csv-from-a-remote-server-using-paramiko-and-dasks-re