RDD CountApproximate taking far longer than requested timeout

后端 未结 1 1932

In an attempt to reduce the time spent on gathering counts of DataFrame rows the RDD.countApproximate() is being invoked. It has the f

1条回答
  •  盖世英雄少女心
    2021-01-12 23:41

    Please note the return type in the approxCount definition. It's a partial result.

    def countApprox(
        timeout: Long,
        confidence: Double = 0.95): PartialResult[BoundedDouble] = withScope {
    

    Now, please pay attention on how it is being used:

    val waitSecs = 60
    val cnt = inputDf.rdd.countApprox(waitSecs * 1000, 0.10).**getFinalValue**.mean
    

    According to spark scala doc, getFinalValue is blocking method which means it will wait for complete operation to finish.

    Whereas initialValue can be fetched within specified timeout. So the following snippet will not block further operations after timeout,

    val waitSecs = 60
    val cnt = inputDf.rdd.countApprox(waitSecs * 1000, 0.10).initialValue.mean
    

    Please note the downside of using countApprox(timeout, confidence).initialValue is that even after getting the value, it will continue counting till it get final count that you would have obtained using getFinalValue and still will hold the resources till operation is complete.

    Now the use of this api is not to get blocked at count operation.

    Reference: https://mail-archives.apache.org/mod_mbox/spark-user/201505.mbox/%3C747872034.1520543.1431544429083.JavaMail.yahoo@mail.yahoo.com%3E

    Now lets validate our assumption of non blocking operation on spark2-shell. Lets create random dataframe and perform count, approxCount with getFinalValue and approxCount with initialValue:

    scala> val schema = StructType((0 to 10).map(n => StructField(s"column_$n", StringType)))
    schema: org.apache.spark.sql.types.StructType = StructType(StructField(column_0,StringType,true), StructField(column_1,StringType,true), StructField(column_2,StringType,true), StructField(column_3,StringType,true), StructField(column_4,StringType,true), StructField(column_5,StringType,true), StructField(column_6,StringType,true), StructField(column_7,StringType,true), StructField(column_8,StringType,true), StructField(column_9,StringType,true), StructField(column_10,StringType,true))
    
    scala> val rows = spark.sparkContext.parallelize(Seq[Row](), 100).mapPartitions { _ => { Range(0, 100000).map(m => Row(schema.map(_ => Random.alphanumeric.filter(_.isLower).head.toString).toList: _*)).iterator } }
    rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[1] at mapPartitions at :32
    
    scala> val inputDf = spark.sqlContext.createDataFrame(rows, schema)
    inputDf: org.apache.spark.sql.DataFrame = [column_0: string, column_1: string ... 9 more fields]
    
    //Please note that cnt will be displayed only when all tasks are completed
    scala> val cnt = inputDf.rdd.count
    cnt: Long = 10000000
    
    scala> val waitSecs = 60
    waitSecs: Int = 60
    
    //cntApproxFinal will be displayed only when all tasks are completed.
    scala> val cntApprxFinal = inputDf.rdd.countApprox(waitSecs * 1000, 0.10).getFinalValue.mean
    [Stage 1:======================================================> (98 + 2) / 100]cntApprxFinal: Double = 1.0E7
    
    scala> val waitSecs = 60
    waitSecs: Int = 60
    
    //Please note that cntApprxInitila in this case, will be displayed exactly after timeout duration. In this case 80 tasks were completed within timeout and it displayed the value of variable. Even after displaying the variable value, it continued will all the remaining tasks
    scala> val cntApprxInitial = inputDf.rdd.countApprox(waitSecs * 1000, 0.10).initialValue.mean
    [Stage 2:============================================>           (80 + 4) / 100]cntApprxInitial: Double = 1.0E7
    
    [Stage 2:=======================================================>(99 + 1) / 100]
    

    Let's have look at spark ui and spark-shell, all 3 operations took same time:

    cntApprxInitial is available before completion of all tasks.

    Hope, this helps!

    0 讨论(0)
提交回复
热议问题