finding min/max with pyspark in single pass over data

问题

I have an RDD with a huge list of numbers (length of lines from file), I want to know how to get the min/max in single pass over the data.

I know that about Min and Max functions but that would require two passes.

回答1:

Try this:

>>> from pyspark.statcounter import StatCounter
>>> 
>>> rdd = sc.parallelize([9, -1, 0, 99, 0, -10])
>>> stats = rdd.aggregate(StatCounter(), StatCounter.merge, StatCounter.mergeStats)
>>> stats.minValue, stats.maxValue
(-10.0, 99.0)

回答2:

Here's a working yet inelegant solution using accumulators. The inelegance lies in that you have to define the zero/initial values before hand so they do not interfere with the data:

from pyspark.accumulators import AccumulatorParam
class MinMaxAccumulatorParam(AccumulatorParam): 
    def zero(self, value): 
        return value
    def addInPlace(self, val1, val2): 
        return(min(val1[0],val2[0]), max(val1[1],val2[1]))

minmaxAccu = sc.accumulator([500,-500], MinMaxAccumulatorParam())

def g(x):
    global minmaxAccu
    minmaxAccu += (x,x)

rdd = sc.parallelize([1, 2, 3, 4, 5])

rdd.foreach(g)

In [149]: minmaxAccu.value
Out[149]: (1, 5)

来源：https://stackoverflow.com/questions/36559809/finding-min-max-with-pyspark-in-single-pass-over-data

标签

python

apache-spark

pyspark

rdd

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!