How to compute percentiles in Apache Spark

后端 未结 10 542
遥遥无期
遥遥无期 2020-12-04 22:08

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]

10条回答
  •  被撕碎了的回忆
    2020-12-04 22:59

    Here is my easy approach:

    val percentiles = Array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)
    val accuracy = 1000000
    df.stat.approxQuantile("score", percentiles, 1.0/accuracy)
    

    output:

    scala> df.stat.approxQuantile("score", percentiles, 1.0/accuracy)
    res88: Array[Double] = Array(0.011044141836464405, 0.02022990956902504, 0.0317261666059494, 0.04638145491480827, 0.06498630344867706, 0.0892181545495987, 0.12161539494991302, 0.16825592517852783, 0.24740923941135406, 0.9188197255134583)
    

    accuracy: The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

提交回复
热议问题