What is the concept of application, job, stage and task in spark?

后端 未结 3 762
你的背包
你的背包 2020-12-13 00:08

Is my understanding right?

  1. Application: one spark submit.

  2. job: once a lazy evaluation happens, there is a job.

  3. stage: It

相关标签:
3条回答
  • 2020-12-13 00:45

    The main function is the application.

    When you invoke an action on an RDD, a "job" is created. Jobs are work submitted to Spark.

    Jobs are divided into "stages" based on the shuffle boundary. This can help you understand.

    Each stage is further divided into tasks based on the number of partitions in the RDD. So tasks are the smallest units of work for Spark.

    0 讨论(0)
  • 2020-12-13 00:45

    A very nice definition I found in Cloudera documentation. Here is the point.

    In MapReduce, the highest-level unit of computation is a job. A job loads data, applies a map function, shuffles it, applies a reduce function, and writes data back out to persistent storage. But in Spark, the highest-level unit of computation is an application. A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. A Spark job can consist of more than just a single map and reduce.

    0 讨论(0)
  • 2020-12-13 01:03

    From 7-steps-for-a-developer-to-learn-apache-spark

    An anatomy of a Spark application usually comprises of Spark operations, which can be either transformations or actions on your data sets using Spark’s RDDs, DataFrames or Datasets APIs. For example, in your Spark app, if you invoke an action, such as collect() or take() on your DataFrame or Dataset, the action will create a job. A job will then be decomposed into single or multiple stages; stages are further divided into individual tasks; and tasks are units of execution that the Spark driver’s scheduler ships to Spark Executors on the Spark worker nodes to execute in your cluster. Often multiple tasks will run in parallel on the same executor, each processing its unit of partitioned dataset in its memory.

    0 讨论(0)
提交回复
热议问题