How to execute scheduled SQL script on Amazon Redshift?

问题

I have series of ~10 queries to be executed every hour automatically in Redshift (maybe report success/failure).

Most queries are aggregation on my tables.

I have tried using AWS Lambda with CloudWatch Events, but Lambda functions only survive for 5 minutes max and my queries can take up to 25 minutes.

回答1:

It's kind of strange that AWS doesn't provide a simple distributed cron style service. It would be useful for so many things. There is SWF, but the timing/scheduling aspect is left up to the user. You could use Lambda/Cloudwatch to trigger SWF events. That's a lot of overhead to get reasonable cron like activity.

Like the comment says the easiest way would be to run a small instance and host cron jobs there. Use an autoscale group of 1 for some reliability. A similar but more complicated approach is to use elastic beanstalk.

If you really want redundancy, reliability, visibility, etc. it might be worth looking at a third party solution like Airflow. There are many others depending on your language of preference.

Here's a similar question with more info.

回答2:

i had the same problem in the past,

you can use R or Python for that.

i used R , you can install package RpostgreSQL and connecting to your Redshift attached example:

drv <- dbDriver("PostgreSQL")
conn <-dbConnect(drv,host='mm-stats-1.ctea4hmr4vlw.us-east-1.redshift.amazonaws.com',port='5439',dbname='stats',user='xxx',password='yyy')

and then you can build report with markdown and then scheduled it with crontab task.

also i used mailR package to send the report to other users

回答3:

use aws lambda to run your script. you can schedule it. see https://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html

this uses CloudWatch events behind the scenes. If you do it from the console, it will set things up for you.

回答4:

You can use Data Pipeline to do that, although I think it's on an end-of-life path since they haven't released any new features to the service in a while and the GUI is pretty archaic and difficult to work with. The main benefit of using Data Pipeline over Lambda is that Lambda functions can only run for a maximum of 15 minutes, whereas Data Pipeline can track the status of the query until it's complete.

来源：https://stackoverflow.com/questions/42564910/how-to-execute-scheduled-sql-script-on-amazon-redshift

标签

amazon-web-services

aws-lambda

amazon-redshift

etl