Reading multiple csv files from S3 bucket with boto3

问题

I need to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.

I am able to read single file from following script in python

 s3 = boto3.resource('s3')
 bucket = s3.Bucket('test-bucket')
 for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()

Following is my path

 files/splittedfiles/Code-345678

In Code-345678 I have multiple csv files which I have to read and combine it to single dataframe in pandas

Also, how do I pass a list of selected Codes as a list,so that it will read those folders only. e.g.

files/splittedfiles/Code-345678
files/splittedfiles/Code-345679
files/splittedfiles/Code-345680
files/splittedfiles/Code-345681
files/splittedfiles/Code-345682

From above I need to read files under following codes only.

345678,345679,345682

How can I do it in python?

回答1:

The boto3 API does not support reading multiple objects at once. What you can do is retrieve all objects with a specified prefix and load each of the returned objects with a loop. To do this you can use the filter() method and set the Prefix parameter to the prefix of the objects you want to load. Below I've made this simple change to your code that will let you get all the objects with the prefix "files/splittedfiles/Code-345678" that you can read by looping through those objects where you can load each file into a DataFrame:

s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix="files/splittedfiles/Code-345678")
for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()

If you have multiple prefixes you are going to want to evaluate you can take the above and turn it into a function where the prefix is a parameter then combine the results together. The function could like something like this:

import pandas as pd

def read_prefix_to_df(prefix):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('test-bucket')
    prefix_objs = bucket.objects.filter(Prefix=prefix)
    prefix_df = []
    for obj in prefix_objs:
        key = obj.key
        body = obj.get()['Body'].read()
        df = pd.DataFrame(body)
        prefix_df.append(df)
    return pd.concat(prefix_df)

Then you can iteratively apply this function to each prefix and combine the results in the end.

回答2:

Can you do it like this, using "filter" instead of "all":

for obj in bucket.objects.filter(Prefix='files/splittedfiles/'):
    key = obj.key
    body = obj.get()['Body'].read()

来源：https://stackoverflow.com/questions/52855221/reading-multiple-csv-files-from-s3-bucket-with-boto3

标签

python

csv

amazon-s3

boto3