Read a file line by line from S3 using boto?

前端 未结 10 1026
刺人心
刺人心 2020-11-29 07:32

I have a csv file in S3 and I\'m trying to read the header line to get the size (these files are created by our users so they could be almost any size). Is there a way to do

相关标签:
10条回答
  • 2020-11-29 08:05

    I know it's a very old question.

    But as for now, we can just use s3_conn.get_object(Bucket=bucket, Key=key)['Body'].iter_lines()

    0 讨论(0)
  • 2020-11-29 08:06

    Using boto3:

    s3 = boto3.resource('s3')
    obj = s3.Object(BUCKET, key)
    for line in obj.get()['Body']._raw_stream:
        # do something with line
    
    0 讨论(0)
  • 2020-11-29 08:10

    If you want to read multiple files (line by line) with a specific bucket prefix (i.e., in a "subfolder") you can do this:

    s3 = boto3.resource('s3', aws_access_key_id='<key_id>', aws_secret_access_key='<access_key>')
    
        bucket = s3.Bucket('<bucket_name>')
        for obj in bucket.objects.filter(Prefix='<your prefix>'):
            for line in obj.get()['Body'].read().splitlines():
                print(line.decode('utf-8'))
    

    Here lines are bytes so I am decoding them; but if they are already a string, you can skip that.

    0 讨论(0)
  • 2020-11-29 08:12

    Here's a solution which actually streams the data line by line:

    from io import TextIOWrapper
    from gzip import GzipFile
    ...
    
    # get StreamingBody from botocore.response
    response = s3.get_object(Bucket=bucket, Key=key)
    # if gzipped
    gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
    data = TextIOWrapper(gzipped)
    
    for line in data:
        # process line
    
    0 讨论(0)
提交回复
热议问题