Running scheduled Spark job

后端未结

关注

 6  598

庸人自扰 2020-12-08 02:07

I have a Spark job which reads a source table, does a number of map / flatten / reduce operations and then stores the results into a separate table we use for reporting. Cur

6条回答

余生分开走 (楼主)

2020-12-08 02:53

Crontab is good enough only if you don't care about high availability, since it will run on a single machine that can fail.

The fact that you run in a stand alone mode indicate that you don't have hadoop and mesos installed, that have some tools to make this task more reliable.

An alternative to crontab (though it suffers from high availability issues as well at the moment) is airbnb's airflow. It was built for such use cases exactly (among others) see here: http://airflow.incubator.apache.org/scheduler.html.

Mesos users can try using chronos which is a cron job for clusters: https://github.com/mesos/chronos.

There is also oozie that comes from the hadoop world http://blog.cloudera.com/blog/2013/01/how-to-schedule-recurring-hadoop-jobs-with-apache-oozie/.

If this is a mission critical, you can even program it yourself if you use consul/zookeper or other tools that provide leader election - just have your processes run on multiple machines, have them compete on leadership and make sure the leader submits the job to the spark.

You can use spark job server to make the job submission more elegant: https://github.com/spark-jobserver/spark-jobserver

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...