问题
I have a set of objects in Mongodb that each have a set of values embedded in them, e.g.:
[1.22, 12.87, 1.24, 1.24, 9.87, 1.24, 87.65] // ... up to about 150 values
Is a map/reduce the best solution for finding the median (average) and mode (most common value) in the embedded arrays? The reason that I ask is that the map and the reduce both have to return the same (structurally) set of values. It looks like in my case I want to take in a set of values (the array) and return a set of two values (median, mode).
If not, what's the best way to approach this? I want it to run in a rake task, if that's relevant. It'd be an overnight data crunching kind of thing.
回答1:
I assume you want to find the mode & median of each document, you can do this with map reduce. In this case you calculate median & mode in the map function and reduce will return the map result untouched
map = function() {
var res = 0;
for (i = 0; i < this.marks.length; i++) {
res = res + this.marks[i];
}
var median = res/this.marks.length;
emit(this._id,{marks:this.marks,median:median});
}
reduce = function (k, values) {
values.forEach(function(value) {
result = value;
});
return result;
}
and for this collection
{ "_id" : ObjectId("4f02be1f1ae045175f0eb9f1"), "name" : "ram", "marks" : [ 1.22, 12.87, 1.24, 1.24, 9.87, 1.24, 87.65 ] }
{ "_id" : ObjectId("4f02be371ae045175f0eb9f2"), "name" : "sam", "marks" : [ 1.32, 11.87, 12.4, 4.24, 9.37, 3.24, 7.65 ] }
{ "_id" : ObjectId("4f02be4c1ae045175f0eb9f3"), "name" : "pam", "marks" : [ 3.32, 10.17, 11.4, 2.24, 2.37, 3.24, 30.65 ] }
you can get the median by
db.test.mapReduce(map,reduce,{out: { inline : 1}})
{
"results" : [
{
"_id" : ObjectId("4f02be1f1ae045175f0eb9f1"),
"value" : {
"marks" : [
1.22,
12.87,
1.24,
1.24,
9.87,
1.24,
87.65
],
"median" : 16.475714285714286
}
},
{
"_id" : ObjectId("4f02be371ae045175f0eb9f2"),
"value" : {
"marks" : [
1.32,
11.87,
12.4,
4.24,
9.37,
3.24,
7.65
],
"median" : 7.155714285714285
}
},
{
"_id" : ObjectId("4f02be4c1ae045175f0eb9f3"),
"value" : {
"marks" : [
3.32,
10.17,
11.4,
2.24,
2.37,
3.24,
30.65
],
"median" : 9.055714285714286
}
}
],
"timeMillis" : 1,
"counts" : {
"input" : 3,
"emit" : 3,
"reduce" : 0,
"output" : 3
},
"ok" : 1,
}
回答2:
There's a key question here regarding the expected output. It's not 100% clear from your question which one you want.
Do you want (A):
{ _id: "document1", value: { mode: 1.0, median: 10.0 } }
{ _id: "document2", value: { mode: 5.0, median: 150.0 } }
... one for each document
... or do you want (B), the mode and median across all the combination of all arrays.
- If the answer is (A), then Map/Reduce will work.
- If the answer is (B), then Map/Reduce will probably not work.
If you plan to do (A), please read the M/R documentation carefully and understand the limitations. While option (A) can be a Map/Reduce, it can also just be a big for
loop with an upsert
on the "summary" collection or even back into the original collection. This may be even more efficient.
回答3:
I would start by read this http://www.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/.
I think you want
- A map stage to generate your key and single data element,
- A reduce stage to place all the data elements into a data array for each key,
- A finalize stage to perform your mean, median and mode
operations on the entire collection.
Finalize Function
A finalize function may be run after reduction. Such a function is optional and is not necessary for many map/reduce cases. The finalize function takes a key and a value, and returns a finalized value.
function finalize(key, value) -> final_value
Your reduce function may be called multiple times for the same object. Use finalize when something should only be done a single time at the end; for example calculating an average.
Taken from http://www.mongodb.org/display/DOCS/MapReduce
来源:https://stackoverflow.com/questions/8706220/is-map-reduce-appropriate-for-finding-the-median-and-mode-of-a-set-of-values-for