Computing median in map reduce

后端 未结 4 1908
礼貌的吻别
礼貌的吻别 2020-12-14 20:33

Can someone example the computation of median/quantiles in map reduce?

My understanding of Datafu\'s median is that the \'n\' mappers sort the data and send the da

4条回答
  •  渐次进展
    2020-12-14 21:10

    Do you really need the exact median and quantiles?

    A lot of the time, you are better off with just getting approximate values, and working with them, in particular if you use this for e.g. data partitioning.

    In fact, you can use the approximate quantiles to speed up finding the exact quantiles (actually in O(n/p) time), here is a rough outline of the strategy:

    1. Have a mapper for each partition compute the desired quantiles, and output them to a new data set. This data set should be several order of magnitues smaller (unless you ask for too many quantiles!)
    2. Within this data set, compute the quantiles again, similar to "median of medians". These are your initial estimates.
    3. Repartition the data according to these quantiles (or even additional partitions obtained this way). The goal is that in the end, the true quantile is guaranteed to be in one partition, and there should be at most one of the desired quantiles in each partition
    4. Within each of the partitions, perform a QuickSelect (in O(n)) to find the true quantile.

    Each of the steps is in linear time. The most costly step is part 3, as it will require the whole data set to be redistributed, so it generates O(n) network traffic. You can probably optimize the process by choosing "alternate" quantiles for the first iteration. Say, you want to find the global median. You can't find it in a linear process easily, but you can probably narrow it down to 1/kth of the data set, when it is split into k partitions. So instead of having each node report its median, have each node additionally report the objects at (k-1)/(2k) and (k+1)/(2k). This should allow you to narrow down the range of values where the true median must lie signficantly. So in the next step, you can each node send those objects that are within the desired range to a single master node, and choose the median within this range only.

提交回复
热议问题