Triggering AWS Lambda on arrival of new files in AWS S3

问题

I have a Lambda function written in Python, which has the code to run Redshift copy commands for 3 tables from 3 files located in AWS S3.

Example:

I have table A, B and C.

The python code contains:

'copy to redshift A from "s3://bucket/abc/A.csv"'
'copy to redshift B from "s3://bucket/abc/B.csv"'
'copy to redshift C from "s3://bucket/abc/C.csv"'

This code is triggered whenever a new file among the three arrives at "s3://bucket/abc/" location in S3. So, it loads all the three tables even if only one csv file has arrived.

Best case solution: Break down the code into three different Lambda function and directly map them to each source files update/upload.

But, my requirement is to go ahead with a single Lambda code, which will selectively run a part of it (using if) for only those csv files which got updated.

Example:

if (new csv file for A has arrived):
    'copy to redshift A from "s3://bucket/abc/A.csv"'
if (new csv file for B has arrived):
    'copy to redshift B from "s3://bucket/abc/B.csv"'
if (new csv file for C has arrived):
    'copy to redshift C from "s3://bucket/abc/C.csv"'

Currently, to achieve this, I am storing those files' metadata (LastModified) in a python dict with the file names being the key. Printing the dict would be something like this:

{'bucket/abc/A.csv': '2019-04-17 11:14:11+00:00', 'bucket/abc/B.csv': '2019-04-18 12:55:47+00:00', 'bucket/abc/C.csv': '2019-04-17 11:09:55+00:00'}

And then, whenever a new file appears among anyone of the three, Lambda is triggered and I'm reading the dict and comparing the times of the each file with the respective values in the dict, if the new LastModified is increased, I'm running that table's copy command.

All these, because there is no work around I could find with S3 event/CloudWatch for this kind of use-case.

Please ask further questions, if the problem couldn't be articulated well.

回答1:

When an Amazon S3 Event triggers an AWS Lambda function, it provides the Bucket name and Object key as part of the event:

def lambda_handler(event, context):

  # Get the bucket and object key from the Event
  bucket = event['Records'][0]['s3']['bucket']['name']
  key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])

While the object details as passed as a list, I suspect that each event is only ever supplied with one object (hence the use of [0]). However, I'm not 100% certain that this will always be the case. Best to assume it until proven otherwise.

Thus, if your code is expecting specific objects, your code would be:

if key == 'abc/A.csv':
    'copy to Table-A from "s3://bucket/abc/A.csv"'
if key == 'abc/B.csv':
    'copy to Table-B from "s3://bucket/abc/B.csv"'
if key == 'abc/C.csv':
    'copy to Table-C from "s3://bucket/abc/C.csv"'

There is no need to store LastModified, since the event is triggered whenever a new file is uploaded. Also, be careful about storing data in a global dict and expecting it to be around at a future execution — this will not always be the case. A Lambda container can be removed if it does not run for a period of time, and additional Lambda containers might be created if there is concurrent execution.

If you always know that you are expecting 3 files and they are always uploaded in a certain order, then you could instead use the upload of the 3rd file to trigger the process, which would then copy all 3 files to Redshift.

来源：https://stackoverflow.com/questions/55794233/triggering-aws-lambda-on-arrival-of-new-files-in-aws-s3

标签

python

amazon-web-services

amazon-s3

aws-lambda

amazon-cloudwatch