I have series of ~10 queries to be executed every hour automatically in Redshift (maybe report success/failure).
Most queries are aggregation on my tables.
I have tried using AWS Lambda with CloudWatch Events, but Lambda functions only survive for 5 minutes max and my queries can take up to 25 minutes.
It's kind of strange that AWS doesn't provide a simple distributed cron style service. It would be useful for so many things. There is SWF, but the timing/scheduling aspect is left up to the user. You could use Lambda/Cloudwatch to trigger SWF events. That's a lot of overhead to get reasonable cron like activity.
Like the comment says the easiest way would be to run a small instance and host cron jobs there. Use an autoscale group of 1 for some reliability. A similar but more complicated approach is to use elastic beanstalk.
If you really want redundancy, reliability, visibility, etc. it might be worth looking at a third party solution like Airflow. There are many others depending on your language of preference.
Here's a similar question with more info.
i had the same problem in the past,
you can use R or Python for that.
i used R , you can install package RpostgreSQL and connecting to your Redshift attached example:
drv <- dbDriver("PostgreSQL")
conn <-dbConnect(drv,host='mm-stats-1.ctea4hmr4vlw.us-east-1.redshift.amazonaws.com',port='5439',dbname='stats',user='xxx',password='yyy')
and then you can build report with markdown and then scheduled it with crontab task.
also i used mailR package to send the report to other users
use aws lambda to run your script. you can schedule it. see https://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html
this uses CloudWatch events behind the scenes. If you do it from the console, it will set things up for you.
You can use Data Pipeline to do that, although I think it's on an end-of-life path since they haven't released any new features to the service in a while and the GUI is pretty archaic and difficult to work with. The main benefit of using Data Pipeline over Lambda is that Lambda functions can only run for a maximum of 15 minutes, whereas Data Pipeline can track the status of the query until it's complete.
来源:https://stackoverflow.com/questions/42564910/how-to-execute-scheduled-sql-script-on-amazon-redshift