I\'ve just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysi
Do make sure the Amazon SageMaker role has policy attached to it to have access to S3. It can be done in IAM.
import boto3
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
You could also access your bucket as your file system using s3fs
import s3fs
fs = s3fs.S3FileSystem()
# To List 5 files in your accessible bucket
fs.ls('s3://bucket-name/data/')[:5]
# open it directly
with fs.open(f's3://bucket-name/data/image.png') as f:
display(Image.open(f))
This code sample to import csv file from S3, tested at SageMaker notebook.
Use pip or conda to install s3fs. !pip install s3fs
import pandas as pd
my_bucket = '' #declare bucket name
my_file = 'aa/bb.csv' #declare file path
import boto3 # AWS Python SDK
from sagemaker import get_execution_role
role = get_execution_role()
data_location = 's3://{}/{}'.format(my_bucket,my_file)
data=pd.read_csv(data_location)
data.head(2)
If you have a look here it seems you can specify this in the InputDataConfig. Search for "S3DataSource" (ref) in the document. The first hit is even in Python, on page 25/26.
You can also use AWS Data Wrangler https://github.com/awslabs/aws-data-wrangler:
import awswrangler as wr
df = wr.pandas.read_csv(path="s3://...")