AWS Lambda: How to extract a tgz file in a S3 bucket and put it in another S3 bucket

问题

I have an S3 bucket named "Source". Many '.tgz' files are being pushed into that bucket in real-time. I wrote an Java code for extracting the '.tgz' file and pushing it into "Destination" bucket. I pushed my code as Lambda function. I got the '.tgz' file as InputStream in my Java code. How to extract it in Lambda ? I'm not able to create a file in Lambda, it throws "FileNotFound(Permission Denied)" in JAVA.

AmazonS3 s3Client = new AmazonS3Client();
S3Object s3Object = s3Client.getObject(new GetObjectRequest(srcBucket, srcKey));
InputStream objectData = s3Object.getObjectContent();
File file = new File(s3Object.getKey());
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file)); <--- It throws FileNotFound(Permission denied) here

回答1:

Since one of the responses was in Python i provide alternative solution in this language.

Problem with the solution using /tmp file-system is, that AWS allows to store only 512 MB there (read more). In order to untar or unzip larger files it's better to use io package and BytesIO class and process file contents purely in memory. AWS allows to assign up to 3GB of RAM to a Lambda and this extends max file size significantly. I successfully tested untar'ing with 1GB S3 file.

In my case un-taring of ~2000 files from 1GB tar-file to another S3 bucket took 140 seconds. It can by further optimized by utilizing multiple threads for uploading un-tarred files to target S3 bucket.

Example code below present single-threaded solution:

import boto3
import botocore
import tarfile

from io import BytesIO
s3_client = boto3.client('s3')

def untar_s3_file(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    input_tar_file = s3_client.get_object(Bucket = bucket, Key = key)
    input_tar_content = input_tar_file['Body'].read()

    with tarfile.open(fileobj = BytesIO(input_tar_content)) as tar:
        for tar_resource in tar:
            if (tar_resource.isfile()):
                inner_file_bytes = tar.extractfile(tar_resource).read()
                s3_client.upload_fileobj(BytesIO(bytes_content), Bucket = bucket, Key = tar_resource.name)

回答2:

import boto3
import botocore
import tarfile
from tarfile import TarInfo
from botocore.client import Config
s3_client = boto3.client('s3')
s3_resource=boto3.resource('s3')
def lambda_handler(event, context):
    bucket =event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    new_bucket='uncompressed-data' #new bucket name
    new_key=key[:-4]
    try:
        s3_client.download_file(bucket, key, '/tmp/file')
        if(tarfile.is_tarfile('/tmp/file')):
           tar = tarfile.open('/tmp/file', "r:gz")
           for TarInfo in tar:
               tar.extract(TarInfo.name, path='/tmp/extract/')
        s3_client.upload_file('/tmp/extract/'+TarInfo.name,new_bucket, new_key)
        tar.close()
    except Exception as e:
        print(e)
        raise e

Use Python 3.6 and trigger an event for obejctcreated(all) with suffix ".tgz". Hope this helps you. Check out this Link

回答3:

Don't use a File or FileOutputStream, use s3Client.putObject(). To read the tgz file, you can use a Apache Commons Compress. Example:

ArchiveInputStream tar = new ArchiveInputStreamFactory().
    createArchiveInputStream("tar", new GZIPInputStream(objectData));
ArchiveEntry entry;
while ((entry = tar.getNextEntry()) != null) {
    if (!entry.isDirectory()) {
        byte[] objectBytes = new byte[entry.getSize()];
        tar.read(objectBytes);
        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setContentLength(objectBytes.length);
        metadata.setContentType("application/octet-stream");
        s3Client.putObject(destBucket, entry.getName(), 
            new ByteArrayInputStream(objectBytes), metadata);
    }
}

来源：https://stackoverflow.com/questions/35226804/aws-lambda-how-to-extract-a-tgz-file-in-a-s3-bucket-and-put-it-in-another-s3-bu

标签

java

amazon-web-services

amazon-s3

aws-lambda