How best to convert from azure blob csv format to pandas dataframe while running notebook in azure ml

為{幸葍}努か 提交于 2019-11-30 04:53:41

问题


I have a number of large csv (tab delimited) data stored as azure blobs, and I want to create a pandas dataframe from these. I can do this locally as follows:

from azure.storage.blob import BlobService
import pandas as pd
import os.path

STORAGEACCOUNTNAME= 'account_name'
STORAGEACCOUNTKEY= "key"
LOCALFILENAME= 'path/to.csv'        
CONTAINERNAME= 'container_name'
BLOBNAME= 'bloby_data/000000_0'

blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)

# Only get a local copy if haven't already got it
if not os.path.isfile(LOCALFILENAME):
    blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)

df_customer = pd.read_csv(LOCALFILENAME, sep='\t')

However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from csv, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).

I can get to the desired end result (pandas dataframe for blob csv data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas dataframe, but I'd prefer to just read straight from the blob storage location.


回答1:


I think you want to use get_blob_to_bytes, or get_blob_to_text; these should output a string which you can use to create a dataframe as

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
df = pd.read_csv(StringIO(blobstring))



回答2:


Thanks for the answer, I think some correction is needed. You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))


来源:https://stackoverflow.com/questions/33091830/how-best-to-convert-from-azure-blob-csv-format-to-pandas-dataframe-while-running

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!