Running scheduled Spark job

后端未结

关注

 6  593

庸人自扰 2020-12-08 02:07

I have a Spark job which reads a source table, does a number of map / flatten / reduce operations and then stores the results into a separate table we use for reporting. Cur

6条回答

天涯浪人 (楼主)

2020-12-08 03:00

You can use a cron tab, but really as you start having spark jobs that depend on other spark jobs i would recommend pinball for coordination. https://github.com/pinterest/pinball

To get a simple crontab working I would create wrapper script such as

#!/bin/bash
cd /locm/spark_jobs

export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_USER_NAME=hdfs
export HADOOP_GROUP=hdfs

#export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*

CLASS=$1
MASTER=$2
ARGS=$3
CLASS_ARGS=$4
echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"

$SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 $ARGS spark-jobs-assembly*.jar $CLASS_ARGS >> /locm/spark_jobs/logs/$CLASS.log 2>&1

Then create a crontab by

crontab -e
Insert 30 1 * * * /PATH/TO/SCRIPT.sh $CLASS "yarn-client"

0 讨论(0)

查看其它6个回答