Transfer from ADLS2 to Compute Target very slow Azure Machine Learning

岁酱吖の 提交于 2020-07-09 13:20:12

问题


During a training script executed on a compute target, we're trying to download a registered Dataset from an ADLS2 Datastore. The problem is that it takes hours to download ~1.5Gb (splitted into ~8500 files) to the compute target with the following method :

from azureml.core import Datastore, Dataset, Run, Workspace

# Retrieve the run context to get Workspace
RUN = Run.get_context(allow_offline=True)

# Retrieve the workspace
ws = RUN.experiment.workspace

# Creating the Dataset object based on a registered Dataset
dataset = Dataset.get_by_name(ws, name='my_dataset_registered')

# Download the Dataset locally
dataset.download(target_path='/tmp/data', overwrite=False)

Important note : the Dataset is registered to a path in the Datalake that contains a lot of subfolders (as well subsubfolders, ..) containing small files of around 170Kb.

Note: I'm able to download the complete dataset to local computer within a few minutes using az copy or the Storage Explorer. Also, the Dataset is defined at a folder stage with the ** wildcard for scanning subfolders : datalake/relative/path/to/folder/**

Is that a known issue ? How can I improve transfer speed ?

Thanks !


回答1:


Edited to be more answer-like:

It'd be helpful to include: what versions of azureml-core and azureml-dataprep SDK you are using, what type of VM you are running as the compute instance, and what types of files (e.g. jpg? txt?) your dataset is using. Also, what are you trying to achieve by downloading the complete dataset to your compute?

Currently, compute instance image comes with azureml-core 1.0.83 and azureml-dataprep 1.1.35 pre-installed, which are 1-2 months old. You might be using even older versions. You can try upgrading by running in your notebook:

%pip install -U azureml-sdk

If you don't see any improvements to your scenario, you can file an issue on the official docs page to get someone to help debug your issue, such as the ref page for FileDataset.

(edited on June 9, 2020 to remove mention of experimental release because that is not happening anymore)




回答2:


DataTransferStep creates an Azure ML Pipeline step that transfers data between.

Please follow the below for DataTransferStep class. https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.data_transfer_step.datatransferstep?view=azure-ml-py



来源:https://stackoverflow.com/questions/60562966/transfer-from-adls2-to-compute-target-very-slow-azure-machine-learning

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!