Running scheduled Spark job

后端 未结 6 593
庸人自扰
庸人自扰 2020-12-08 02:07

I have a Spark job which reads a source table, does a number of map / flatten / reduce operations and then stores the results into a separate table we use for reporting. Cur

6条回答
  •  天涯浪人
    2020-12-08 03:00

    You can use a cron tab, but really as you start having spark jobs that depend on other spark jobs i would recommend pinball for coordination. https://github.com/pinterest/pinball

    To get a simple crontab working I would create wrapper script such as

    #!/bin/bash
    cd /locm/spark_jobs
    
    export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export HADOOP_USER_NAME=hdfs
    export HADOOP_GROUP=hdfs
    
    #export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*
    
    CLASS=$1
    MASTER=$2
    ARGS=$3
    CLASS_ARGS=$4
    echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"
    
    $SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 $ARGS spark-jobs-assembly*.jar $CLASS_ARGS >> /locm/spark_jobs/logs/$CLASS.log 2>&1
    

    Then create a crontab by

    1. crontab -e
    2. Insert 30 1 * * * /PATH/TO/SCRIPT.sh $CLASS "yarn-client"

提交回复
热议问题