MongoDB aggregate queries vs. MySQL SELECT field1 FROM table

问题

I am completely new to MongoDB and wanted to compare query performance of a NoSQL data model relative to its relational database counter part. I wrote this into MongoDB shell

// Make 10 businesses
// Each business has 10 locations
// Each location has 10 departments
// Each department has 10 teams
// Each team has 100 employees
(new Array(10)).fill(0).forEach(_=>
    db.businesses.insert({
        "name":"Business Name",
        "locations":(new Array(10)).fill(0).map(_=>({
            "name":"Office Location",
            "departments":(new Array(10)).fill(0).map(_=>({
                "name":"Department",
                "teams":(new Array(10)).fill(0).map(_=>({
                    "name":"Team Name",
                    "employees":(new Array(100)).fill(0).map(_=>({
                        "age":Math.floor(Math.random()*100)
                    }))
                }))
            }))
        }))
    })
);

Then I attempted the equivalent of MySQL's EXPLAIN SELECT age,name,(and a few other fields) FROM employees WHERE age >= 50 ORDER BY age DESC by writing this statement:

db.businesses.aggregate([
    { $unwind: "$locations" },
    { $unwind: "$locations.departments" },
    { $unwind: "$locations.departments.teams" },
    { $unwind: "$locations.departments.teams.employees" },
    { $project: { _id: 0, age: "$locations.departments.teams.employees.age" } },
    { $match: { "age": { $gte: 50 }} },
    { $sort: {"age" : -1}}
]).explain("executionStats")

The result was:

"errmsg" : "Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.",

So I deleted the sort clause and try to get an explain. But the result was:

TypeError: db.businesses.aggregate(...).explain is not a function

So my questions are:

Primarily, I want to know the performance difference of SELECT age FROM employees WHERE age >= 50 ORDER BY age DESC when compared with the MongoDB's aggregate query counter part. Is it more or less the same? Will one be substantially faster or more performant than the other?
Alternatively, how do I fix my MongoDB query so that I can get performance details to compare against my MySQL query counter part?

回答1:

Employees are single entities; thus, you probably don't want to model age of a team member so deeply in the rich structure of departments and locations and teams. It is perfectly fine to have a separate employees collection and simply do:

db.businesses.aggregate([
{$match: {"age": {$gt: 50} }}
,{$sort: {"age": -1} }
]);

Deep in your businesses collection you can have:

{ teams: [ {name: "T1", employees: [ "E1", "E34" ]} ] }

Alternately, try this:

db.businesses.aggregate([ your pipeline] ,{allowDiskUse:true});

The OP has a setup of 10 biz -> 10 loc -> 10 depts -> 10 teams -> 100 emps. The first 3 unwinds creates a 10000x explosion of data but the last one is 100x beyond that. We can shrink the hit by using $filter:

db.businesses.aggregate([
{ $unwind: "$locations" },
{ $unwind: "$locations.departments" },
{ $unwind: "$locations.departments.teams" },

{$project: {
        XX: {$filter: {
                    input: "$locations.departments.teams.employees",
                    as: "z",
                    cond: {$gte: [ "$$z.age", 50] }
            }}
    }}
,{$unwind: "$XX"}
,{$sort: {"XX.age":-1}}])

回答2:

i was able to get a result in 1.5 seconds without any indexes by modifying the query like the following:

db.businesses.aggregate([
    {
        $unwind: "$locations"
    },
    {
        $unwind: "$locations.departments"
    },
    {
        $unwind: "$locations.departments.teams"
    },
    {
        $unwind: "$locations.departments.teams.employees"
    },
    {
        $match: {
            "locations.departments.teams.employees.age": {
                $gte: 50
            }
        }
    },
    {
        $project: {
            _id: 0,
            age: "$locations.departments.teams.employees.age"
        }
    },
    {
        $group: {
            _id: "$age"
        }
    },
    {
        $project: {
            _id: 0,
            age: "$_id"
        }
    },
    {
        $sort: {
            "age": - 1
        }
    }
], {
    explain: false
})

回答3:

You better move $match to the first pipeline, because aggregation framework loses index after first pipeline, also i guess you don't need to unwind those arrays.

回答4:

There is another way to address the overall problem, although it is not apples to apples with the OP question. The goal is to find all age >= 50 and sort. Below is an example that "almost" does so and throws in the loc,dept,team as well in case you were wondering how to get that too, but you can take out the lines to get just the emps. Now, this is unsorted -- but an argument can be made that the DB engine isn't going to do any better job of sorting this than the client and all the data has to come over the wire anyway. And the client can use more sophisticated coding tricks to dig thru to the age field and sort it.

c = db.foo.aggregate([
{$project: {XX:
  {$map: {input: "$locations", as:"z", in:
          {$map: {input: "$$z.departments", as:"z2", in:
                  {$map: {input: "$$z2.teams", as:"z3", in:
                          {loc: "$$z.name",  // remove if you want
                           dept: "$$z2.name", // remove if you want
                           team: "$$z3.name",  // remove if you want
                           emps: {$filter: {input: "$$z3.employees",
                                     as: "z4",
                                     cond: {$gt: [ "$$z4.age", 50] }
                                    }}
                          }
                      }}
              }}
      }}
    }}
]);

ages = [];

c.forEach(function(biz) {
    biz['XX'].forEach(function(locs) {
        locs.forEach(function(depts) {
            depts.forEach(function(teams) {
                teams['emps'].forEach(function(emp) {
                    ages.push(emp['age']);
                                    });
                            });
                    });
            });
    });

print( ages.sort(function(a, b){return b-a}) );

99,98,97,96,95,94,92,92,84,81,78,77,76,72,71,67,66,65,65,64,63,62,62,61,59,59,57,57,57,56,55,54,52,51

On a MacBook Pro running MongoDB 4.0, we see the collection as follows:

Collection            Count   AvgSize          Unz  Xz  +Idx     TotIdx  Idx/doc
--------------------  ------- -------- -G--M------  --- ---- ---M------  -------
                 foo       10   2238682     22386820  4.0    0      16384    0

Given the random age between 0 and 100, it is not surprising that every loc/dept/team has age >= 50 and that the total number of bytes returned is about half. Note, however that the total time to set up the agg -- not return all the bytes -- is ~700 millis.

697 millis to agg; 0.697
found 10
tot bytes 11536558

来源：https://stackoverflow.com/questions/59090237/mongodb-aggregate-queries-vs-mysql-select-field1-from-table

标签

mysql

mongodb

aggregate