How to make the copy command continue its run in redshift even after the lambda function which initiated it has timed out?

问题

I am trying to run a copy command which loads around 100 GB of data from S3 to redshift. I am using the lambda function to initiate this copy command every day. This is my current code

from datetime import datetime, timedelta
import dateutil.tz
import psycopg2
from config import *

def lambda_handler(event, context):
    con = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port=port)
    cur = con.cursor()
    
    try:
        query = """BEGIN TRANSACTION;

                COPY """ + table_name + """ FROM '""" + intermediate_path + """' iam_role '""" + iam_role + """' FORMAT AS parquet;

                END TRANSACTION;"""

        print(query)
        cur.execute(query)
    
    except Exception as e:
        subject = "Error emr copy: {}".format(str(datetime.now().date()))
        body = "Exception occured " + str(e)
        print(body)
    
    con.close()

This function is running fine but the only problem is, after the 15 min timeout of the lambda function, the copy command also stops executing in reshift. Therefore, I cannot finish my copy loading from s3 to redshift.

I also tried to include the statement_timeout statement below after the begin statement and before the copy command. It didn't help.

SET statement_timeout to 18000000;

Can someone suggest how do I solve this issue?

回答1:

The AWS documentation isn't explicit about what happens when timeout occurs. But I think it's safe to say that it transitions into the "Shutdown" phase, at which point the runtime container is forcibly terminated by the environment.

What this means is that the socket connection used by the database connection will be closed, and the Redshift process that is listening to that socket will receive an end-of-file -- a client disconnect. The normal behavior of any database in this situation is to terminate any outstanding queries and rollback their transactions.

The reason that I gave that description is to let you know that you can't extend the life of a query beyond the life of the Lambda that initiates that query. If you want to stick with using a database connection library, you will need to use a service that doesn't timeout: AWS Batch or ECS are two options.

But, there's a better option: the Redshift Data API, which is supported by Boto3.

This API operates asynchronously: you submit a query to Redshift, and get a token that can be used to check the query's operation. You can also instruct Redshift to send a message to AWS Eventbridge when the query completes/fails (so you can create another Lambda to take appropriate action).

来源：https://stackoverflow.com/questions/65038660/how-to-make-the-copy-command-continue-its-run-in-redshift-even-after-the-lambda

标签

python

postgresql

amazon-web-services

aws-lambda

amazon-redshift