is map/reduce appropriate for finding the median and mode of a set of values for many records?

五迷三道 提交于 2019-12-11 04:35:37

问题


I have a set of objects in Mongodb that each have a set of values embedded in them, e.g.:

[1.22, 12.87, 1.24, 1.24, 9.87, 1.24, 87.65] // ... up to about 150 values

Is a map/reduce the best solution for finding the median (average) and mode (most common value) in the embedded arrays? The reason that I ask is that the map and the reduce both have to return the same (structurally) set of values. It looks like in my case I want to take in a set of values (the array) and return a set of two values (median, mode).

If not, what's the best way to approach this? I want it to run in a rake task, if that's relevant. It'd be an overnight data crunching kind of thing.


回答1:


I assume you want to find the mode & median of each document, you can do this with map reduce. In this case you calculate median & mode in the map function and reduce will return the map result untouched

map = function() {
   var res = 0;
   for (i = 0; i < this.marks.length; i++) {
      res = res + this.marks[i];
   }
   var median = res/this.marks.length;
   emit(this._id,{marks:this.marks,median:median});
}


reduce = function (k, values) {
    values.forEach(function(value) {
      result = value;
    });
    return result;
}

and for this collection

{ "_id" : ObjectId("4f02be1f1ae045175f0eb9f1"), "name" : "ram", "marks" : [ 1.22, 12.87, 1.24, 1.24, 9.87, 1.24, 87.65 ] }
{ "_id" : ObjectId("4f02be371ae045175f0eb9f2"), "name" : "sam", "marks" : [ 1.32, 11.87, 12.4, 4.24, 9.37, 3.24, 7.65 ] }
{ "_id" : ObjectId("4f02be4c1ae045175f0eb9f3"), "name" : "pam", "marks" : [ 3.32, 10.17, 11.4, 2.24, 2.37, 3.24, 30.65 ] }

you can get the median by

  db.test.mapReduce(map,reduce,{out: { inline : 1}})

{
    "results" : [
        {
            "_id" : ObjectId("4f02be1f1ae045175f0eb9f1"),
            "value" : {
                "marks" : [
                    1.22,
                    12.87,
                    1.24,
                    1.24,
                    9.87,
                    1.24,
                    87.65
                ],
                "median" : 16.475714285714286
            }
        },
        {
            "_id" : ObjectId("4f02be371ae045175f0eb9f2"),
            "value" : {
                "marks" : [
                    1.32,
                    11.87,
                    12.4,
                    4.24,
                    9.37,
                    3.24,
                    7.65
                ],
                "median" : 7.155714285714285
            }
        },
        {
            "_id" : ObjectId("4f02be4c1ae045175f0eb9f3"),
            "value" : {
                "marks" : [
                    3.32,
                    10.17,
                    11.4,
                    2.24,
                    2.37,
                    3.24,
                    30.65
                ],
                "median" : 9.055714285714286
            }
        }
    ],
    "timeMillis" : 1,
    "counts" : {
        "input" : 3,
        "emit" : 3,
        "reduce" : 0,
        "output" : 3
    },
    "ok" : 1,
}



回答2:


There's a key question here regarding the expected output. It's not 100% clear from your question which one you want.

Do you want (A):

{ _id: "document1", value: { mode: 1.0, median: 10.0 } }
{ _id: "document2", value: { mode: 5.0, median: 150.0 } }
... one for each document

... or do you want (B), the mode and median across all the combination of all arrays.

  • If the answer is (A), then Map/Reduce will work.
  • If the answer is (B), then Map/Reduce will probably not work.

If you plan to do (A), please read the M/R documentation carefully and understand the limitations. While option (A) can be a Map/Reduce, it can also just be a big for loop with an upsert on the "summary" collection or even back into the original collection. This may be even more efficient.




回答3:


I would start by read this http://www.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/.

I think you want

  1. A map stage to generate your key and single data element,
  2. A reduce stage to place all the data elements into a data array for each key,
  3. A finalize stage to perform your mean, median and mode operations on the entire collection.


Finalize Function

A finalize function may be run after reduction. Such a function is optional and is not necessary for many map/reduce cases. The finalize function takes a key and a value, and returns a finalized value.

function finalize(key, value) -> final_value

Your reduce function may be called multiple times for the same object. Use finalize when something should only be done a single time at the end; for example calculating an average.

Taken from http://www.mongodb.org/display/DOCS/MapReduce



来源:https://stackoverflow.com/questions/8706220/is-map-reduce-appropriate-for-finding-the-median-and-mode-of-a-set-of-values-for

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!