Reading a file from a private S3 bucket to a pandas dataframe

前端未结

关注

 8  765

I\'m trying to read a CSV file from a private S3 bucket to a pandas dataframe:

df = pandas.read_csv(\'s3://mybucket/file.csv\')

I can read

相关标签:

8条回答

孤城傲影

2020-12-08 10:26

Updated for Pandas 0.20.1

Pandas now uses s3fs to handle s3 coonnections. link

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.

import os

import pandas as pd
from s3fs.core import S3FileSystem

# aws keys stored in ini file in same path
# refer to boto3 docs for config settings
os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'

s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'

df = pd.read_csv(s3.open('{}/{}'.format(bucket, key),
                         mode='rb')
                 )

0 讨论(0)

说谎

2020-12-08 10:26

Update for pandas 0.20.3 without using s3fs:

import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
body = obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))

0 讨论(0)

隐瞒了意图╮

2020-12-08 10:27
Note that if your bucket is private AND on an aws-like provider, you will meet errors as s3fs does not load the profile config file at ~/.aws/config like awscli.

One solution is to define the current environment variable :
```
export AWS_S3_ENDPOINT="myEndpoint"
export AWS_DEFAULT_REGION="MyRegion"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2020-12-08 10:30
Based on this answer, I found smart_open to be much simpler to use:
```
import pandas as pd
from smart_open import smart_open

initial_df = pd.read_csv(smart_open('s3://bucket/file.csv'))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-12-08 10:37
Pandas uses boto (not boto3) inside read_csv. You might be able to install boto and have it work correctly.

There's some troubles with boto and python 3.4.4 / python3.5.1. If you're on those platforms, and until those are fixed, you can use boto 3 as
```
import boto3
import pandas as pd

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])
```
That obj had a .read method (which returns a stream of bytes), which is enough for pandas.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-08 10:38
Update for pandas 0.22 and up:

If you have already installed s3fs (pip install s3fs) then you can read the file directly from s3 path, without any imports:
```
data = pd.read_csv('s3:/bucket....csv')
```
stable docs
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页