问题
I have series of ~10 queries to be executed every hour automatically in Redshift (maybe report success/failure).
Most queries are aggregation on my tables.
I have tried using AWS Lambda with CloudWatch Events, but Lambda functions only survive for 5 minutes max and my queries can take up to 25 minutes.
回答1:
It's kind of strange that AWS doesn't provide a simple distributed cron style service. It would be useful for so many things. There is SWF, but the timing/scheduling aspect is left up to the user. You could use Lambda/Cloudwatch to trigger SWF events. That's a lot of overhead to get reasonable cron like activity.
Like the comment says the easiest way would be to run a small instance and host cron jobs there. Use an autoscale group of 1 for some reliability. A similar but more complicated approach is to use elastic beanstalk.
If you really want redundancy, reliability, visibility, etc. it might be worth looking at a third party solution like Airflow. There are many others depending on your language of preference.
Here's a similar question with more info.
回答2:
i had the same problem in the past,
you can use R or Python for that.
i used R , you can install package RpostgreSQL and connecting to your Redshift attached example:
drv <- dbDriver("PostgreSQL")
conn <-dbConnect(drv,host='mm-stats-1.ctea4hmr4vlw.us-east-1.redshift.amazonaws.com',port='5439',dbname='stats',user='xxx',password='yyy')
and then you can build report with markdown and then scheduled it with crontab task.
also i used mailR package to send the report to other users
回答3:
use aws lambda to run your script. you can schedule it. see https://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html
this uses CloudWatch events behind the scenes. If you do it from the console, it will set things up for you.
回答4:
You can use Data Pipeline to do that, although I think it's on an end-of-life path since they haven't released any new features to the service in a while and the GUI is pretty archaic and difficult to work with. The main benefit of using Data Pipeline over Lambda is that Lambda functions can only run for a maximum of 15 minutes, whereas Data Pipeline can track the status of the query until it's complete.
来源:https://stackoverflow.com/questions/42564910/how-to-execute-scheduled-sql-script-on-amazon-redshift