Group by day with Multiple Date Fields

一世执手 提交于 2019-12-11 14:32:17

问题


I have documents stored into MongoDB like this :

{
    "_id" : "XBpNKbdGSgGfnC2MJ",
    "po" : 72134185,
    "machine" : 40940,
    "location" : "02A01",
    "inDate" : ISODate("2017-07-19T06:10:13.059Z"),
    "requestDate" : ISODate("2017-07-19T06:17:04.901Z"),
    "outDate" : ISODate("2017-07-19T06:30:34Z")
}

And I want give the sum, by day, of inDate and outDate.

I can retrieve of both side the sum of documents by inDate day and, on other side, the sum of documents by outDate, but I would like the sum of each.

Currently, I use this pipeline :

      $group: {
        _id: {
          yearA: { $year: '$inDate' },
          monthA: { $month: '$inDate' },
          dayA: { $dayOfMonth: '$inDate' },
        },
        count: { $sum: 1 },
      },

and I give :

{ "_id" : { "year" : 2017, "month" : 7, "day" : 24 }, "count" : 1 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 21 }, "count" : 11 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 19 }, "count" : 20 }

But I would like, if it's possible :

{ "_id" : { "year" : 2017, "month" : 7, "day" : 24 }, "countIn" : 1, "countOut" : 4 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 21 }, "countIn" : 11, "countOut" : 23 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 19 }, "countIn" : 20, "countOut" : 18 }

Any idea ? Many thanks :-)


回答1:


You can also split the documents at the source, by essentially combining each value into an array of entries by "type" for "in" and "out". You can do this simply using $map and $cond to select the fields, then $unwind the array and then determine which field to "count" again by inspecting with $cond:

collection.aggregate([
  { "$project": {
    "dates": {
      "$filter": {
        "input": { 
          "$map": {
            "input": [ "in", "out" ],
            "as": "type",
            "in": {
              "type": "$$type",
              "date": {
                "$cond": {
                  "if": { "$eq": [ "$$type", "in" ] },
                  "then": "$inDate",
                  "else": "$outDate"
                }
              }
            }
          }
        },
        "as": "dates",
        "cond": { "$ne": [ "$$dates.date", null ] }
      }
    }
  }},
  { "$unwind": "$dates" },
  { "$group": {
    "_id": {
      "year": { "$year": "$dates.date" },
      "month": { "$month": "$dates.date" },
      "day": { "$dayOfMonth": "$dates.date" }
    },
    "countIn": {
      "$sum": {
        "$cond": {
          "if": { "$eq": [ "$dates.type", "in" ]  },
          "then": 1,
          "else": 0
        }
      }
    },
    "countOut": {
      "$sum": {
        "$cond": {
          "if": { "$eq": [ "$dates.type", "out" ]  },
          "then": 1,
          "else": 0
        }
      }
    }
  }}
])

That's a safe way to do this that does not risk breaking the BSON limit, no matter what size of data you send at it.

Personally I would rather run as separate processes and "combine" the aggregated results separately, but that would depend on the environment you are running in, which is not mentioned in the question.


For an example of "parallel" execution, you can structure in Meteor somewhere along these lines:

import { Meteor } from 'meteor/meteor';
import { Source } from '../imports/source';
import { Target } from '../imports/target';

Meteor.startup(async () => {
  // code to run on server at startup

  await Source.remove({});
  await Target.remove({});

  console.log('Removed');

  Source.insert({
    "_id" : "XBpNKbdGSgGfnC2MJ",
    "po" : 72134185,
    "machine" : 40940,
    "location" : "02A01",
    "inDate" : new Date("2017-07-19T06:10:13.059Z"),
    "requestDate" : new Date("2017-07-19T06:17:04.901Z"),
    "outDate" : new Date("2017-07-19T06:30:34Z")
  });

  console.log('Inserted');

  await Promise.all(
    ["In","Out"].map( f => new Promise((resolve,reject) => {
      let cursor = Source.rawCollection().aggregate([
        { "$match": { [`${f.toLowerCase()}Date`]: { "$exists": true } } },
        { "$group": {
          "_id": {
            "year": { "$year": `$${f.toLowerCase()}Date` },
            "month": { "$month": `$${f.toLowerCase()}Date` },
            "day": { "$dayOfYear": `$${f.toLowerCase()}Date` }
          },
          [`count${f}`]: { "$sum": 1 }
        }}
      ]);

      cursor.on('data', async (data) => {
        cursor.pause();
        data.date = data._id;
        delete data._id;
        await Target.upsert(
          { date: data.date },
          { "$set": data }
        );
        cursor.resume();
      });

      cursor.on('end', () => resolve('done'));
      cursor.on('error', (err) => reject(err));
    }))
  );

  console.log('Mapped');

  let targets = await Target.find().fetch();
  console.log(targets);

});

Which is essentially going to output to the target collection as was mentioned in comments like:

{
        "_id" : "XdPGMkY24AcvTnKq7",
        "date" : {
                "year" : 2017,
                "month" : 7,
                "day" : 200
        },
        "countIn" : 1,
        "countOut" : 1
}



回答2:


Riiiight. I came up with the following query. Admittedly, I have seen simpler and nicer ones in my life but it certainly gets the job done:

db.getCollection('test').aggregate
(
  {
    $facet: // split aggregation into two pipelines
    {
      "in": [
        { "$match": { "inDate": { "$ne": null } } }, // get rid of null values
        { $group: { "_id": { "y": { "$year": "$inDate" }, "m": { "$month": "$inDate" }, "d": { "$dayOfMonth": "$inDate" } }, "cIn": { $sum : 1 } } }, // compute sum per inDate
      ],
      "out": [
        { "$match": { "outDate": { "$ne": null } } }, // get rid of null values
        { $group: { "_id": { "y": { "$year": "$outDate" }, "m": { "$month": "$outDate" }, "d": { "$dayOfMonth": "$outDate" } }, "cOut": { $sum : 1 } } }, // compute sum per outDate
      ]
    }
  },
  { $project: { "result": { $setUnion: [ "$in", "$out" ] } } }, // merge results into new array
  { $unwind: "$result" }, // unwind array into individual documents
  { $replaceRoot: { newRoot: "$result" } }, // get rid of the additional field level
  { $group: { _id: { year: "$_id.y", "month": "$_id.m", "day": "$_id.d" }, "countIn": { $sum: "$cIn" }, "countOut": { $sum: "$cOut" } } } // group into final result
)

As always with MongoDB aggregations you can get an idea of what's going on by simply reducing the projection stages step by step starting from the end of the query.

EDIT:

As you can see in the comments below there was a bit of a discussion around document size limits and the general applicability of this solution.

So let's look at those aspects in greater detail and let's also compare the performance of the $facet based solution to the one based on $map (suggested by @NeilLunn to avoid potential document size issues).

I created 2 million test records that have random dates assigned to both the "inDate" and the "outDate" field:

{
    "_id" : ObjectId("597857e0fa37b3f66959571a"),
    "inDate" : ISODate("2016-07-29T22:00:00.000Z"),
    "outDate" : ISODate("1988-07-14T22:00:00.000Z")
}

The data range covered was from 01.01.1970 all the way to 01.01.2050, that's a total of 29220 distinct days. Given the random distribution of the 2 million test records across this time range both queries can be expected to return the full 29220 possible results (which both did).

Then I ran both queries five times after restarting my single MongoDB instance freshly and the results in milliseconds I got looked like this:

$facet: 5663, 5400, 5380, 5460, 5520

$map: 9648, 9134, 9058, 9085, 9132

I also measured the size of the single document returned by the facet stage which was 3.19MB so reasonably far away from the MongoDB document size limit (16MB at the time of writing) which, however, only applies to the result document anyway and wouldn't be a problem during pipeline processing.

Bottom line: If you want performance, use the solution suggested here. Be careful about the document size limit, though, in particular if your use case is not the exact one described in the question above (e.g. when you need to collect even more/bigger data). Also, I am not sure if in a sharded scenario both solutions still expose the same performance characteristics...



来源:https://stackoverflow.com/questions/45307580/group-by-day-with-multiple-date-fields

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!