Spark count vs take and length

后端 未结 1 1133
你的背包
你的背包 2020-12-06 15:31

I\'m using com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 when run zeppelin notebooks and don\'t understand the difference between two operations in sp

相关标签:
1条回答
  • 2020-12-06 15:43

    What is you see is a difference between implementation of Limit (an transformation-like operation) and CollectLimit (an action-like operation). However the difference in timings is highly misleading, and not something you can expect in general case.

    First let's create a MCVE

    spark.conf.set("spark.sql.files.maxPartitionBytes", 500)
    
    val ds = spark.read
      .text("README.md")
      .as[String]
      .map{ x => {
        Thread.sleep(1000)
        x
       }}
    
    val dsLimit4 = ds.limit(4)
    

    make sure we start with clean slate:

    spark.sparkContext.statusTracker.getJobIdsForGroup(null).isEmpty
    
    Boolean = true
    

    invoke count:

    dsLimit4.count()
    

    and take a look at the execution plan (from Spark UI):

    == Parsed Logical Plan ==
    Aggregate [count(1) AS count#12L]
    +- GlobalLimit 4
       +- LocalLimit 4
          +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
             +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
                +- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
                   +- Relation[value#0] text
    
    == Analyzed Logical Plan ==
    count: bigint
    Aggregate [count(1) AS count#12L]
    +- GlobalLimit 4
       +- LocalLimit 4
          +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
             +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
                +- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
                   +- Relation[value#0] text
    
    == Optimized Logical Plan ==
    Aggregate [count(1) AS count#12L]
    +- GlobalLimit 4
       +- LocalLimit 4
          +- Project
             +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
                +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
                   +- DeserializeToObject value#0.toString, obj#5: java.lang.String
                      +- Relation[value#0] text
    
    == Physical Plan ==
    *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#12L])
    +- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#15L])
       +- *(2) GlobalLimit 4
          +- Exchange SinglePartition
             +- *(1) LocalLimit 4
                +- *(1) Project
                   +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
                      +- *(1) MapElements <function1>, obj#6: java.lang.String
                         +- *(1) DeserializeToObject value#0.toString, obj#5: java.lang.String
                            +- *(1) FileScan text [value#0] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/path/to/README.md], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
    

    The core component is

    +- *(2) GlobalLimit 4
       +- Exchange SinglePartition
          +- *(1) LocalLimit 4
    

    which indicates that we can expect a wide operation with multiple stages. We can see a single job

    spark.sparkContext.statusTracker.getJobIdsForGroup(null)
    
    Array[Int] = Array(0)
    

    with two stages

    spark.sparkContext.statusTracker.getJobInfo(0).get.stageIds
    
    Array[Int] = Array(0, 1)
    

    with eight

    spark.sparkContext.statusTracker.getStageInfo(0).get.numTasks
    
    Int = 8
    

    and one

    spark.sparkContext.statusTracker.getStageInfo(1).get.numTasks
    
    Int = 1
    

    task respectively.

    Now let's compare it to

    dsLimit4.take(300).size
    

    which generates following

    == Parsed Logical Plan ==
    GlobalLimit 300
    +- LocalLimit 300
       +- GlobalLimit 4
          +- LocalLimit 4
             +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
                +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
                   +- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
                      +- Relation[value#0] text
    
    == Analyzed Logical Plan ==
    value: string
    GlobalLimit 300
    +- LocalLimit 300
       +- GlobalLimit 4
          +- LocalLimit 4
             +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
                +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
                   +- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
                      +- Relation[value#0] text
    
    == Optimized Logical Plan ==
    GlobalLimit 4
    +- LocalLimit 4
       +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
          +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
             +- DeserializeToObject value#0.toString, obj#5: java.lang.String
                +- Relation[value#0] text
    
    == Physical Plan ==
    CollectLimit 4
    +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
       +- *(1) MapElements <function1>, obj#6: java.lang.String
          +- *(1) DeserializeToObject value#0.toString, obj#5: java.lang.String
             +- *(1) FileScan text [value#0] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/path/to/README.md], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
    

    While both global and local limits still occur, there is no exchange in the middle. Therefore we can expect a single stage operation. Please note that planner narrowed down limit to more restrictive value.

    As expected we see a single new job:

    spark.sparkContext.statusTracker.getJobIdsForGroup(null)
    
    Array[Int] = Array(1, 0)
    

    which generated only one stage:

    spark.sparkContext.statusTracker.getJobInfo(1).get.stageIds
    
    Array[Int] = Array(2)
    

    with only one task

    spark.sparkContext.statusTracker.getStageInfo(2).get.numTasks
    
    Int = 1
    

    What does it mean for us?

    • In the count case Spark used wide transformation and actually applies LocalLimit on each partition and shuffles partial results to perform GlobalLimit.
    • In the take case Spark used narrow transformation and evaluated LocalLimit only on the first partition.

    Obviously the latter approach won't work with number of values in the first partition is lower than the requested limit.

    val dsLimit105 = ds.limit(105) // There are 105 lines
    

    In such case the first count will use exactly the same logic as before (I encourage you to confirm that empirically), but take will take rather different path. So far we triggered only two jobs:

    spark.sparkContext.statusTracker.getJobIdsForGroup(null)
    
    Array[Int] = Array(1, 0)
    

    Now if we execute

    dsLimit105.take(300).size
    

    you'll see that it required 3 more jobs:

    spark.sparkContext.statusTracker.getJobIdsForGroup(null)
    
    Array[Int] = Array(4, 3, 2, 1, 0)
    

    So what's going on here? As noted before evaluating a single partition is not enough to satisfy limit in general case. In such case Spark iteratively evaluates LocalLimit on partitions, until GlobalLimit is satisfied, increasing number of partitions taken in each iteration.

    Such strategy can have significant performance implications. Starting Spark jobs alone is not cheap and in cases, when upstream object is a result of wide transformation things can get quite ugly (in the best case scenario you can read shuffle files, but if these are lost for some reason, Spark might be forced to re-execute all the dependencies).

    To summarize:

    • take is an action, and can short circuit in specific cases where upstream process is narrow, and LocalLimits can be satisfy GlobalLimits using the first few partitions.
    • limit is a transformation, and always evaluates all LocalLimits, as there is no iterative escape hatch.

    While one can behave better than the other in specific cases, there not exchangeable and neither guarantees better performance in general.

    0 讨论(0)
提交回复
热议问题