Find all duplicate documents in a MongoDB collection by a key field

后端 未结 5 479
温柔的废话
温柔的废话 2020-11-28 22:00

Suppose I have a collection with some set of documents. something like this.

{ \"_id\" : ObjectId(\"4f127fa55e7242718200002d\"), \"id\":1, \"name\" : \"foo\"         


        
5条回答
  •  粉色の甜心
    2020-11-28 22:39

    The accepted answer is terribly slow on large collections, and doesn't return the _ids of the duplicate records.

    Aggregation is much faster and can return the _ids:

    db.collection.aggregate([
      { $group: {
        _id: { name: "$name" },   // replace `name` here twice
        uniqueIds: { $addToSet: "$_id" },
        count: { $sum: 1 } 
      } }, 
      { $match: { 
        count: { $gte: 2 } 
      } },
      { $sort : { count : -1} },
      { $limit : 10 }
    ]);
    

    In the first stage of the aggregation pipeline, the $group operator aggregates documents by the name field and stores in uniqueIds each _id value of the grouped records. The $sum operator adds up the values of the fields passed to it, in this case the constant 1 - thereby counting the number of grouped records into the count field.

    In the second stage of the pipeline, we use $match to filter documents with a count of at least 2, i.e. duplicates.

    Then, we sort the most frequent duplicates first, and limit the results to the top 10.

    This query will output up to $limit records with duplicate names, along with their _ids. For example:

    {
      "_id" : {
        "name" : "Toothpick"
    },
      "uniqueIds" : [
        "xzuzJd2qatfJCSvkN",
        "9bpewBsKbrGBQexv4",
        "fi3Gscg9M64BQdArv",
      ],
      "count" : 3
    },
    {
      "_id" : {
        "name" : "Broom"
      },
      "uniqueIds" : [
        "3vwny3YEj2qBsmmhA",
        "gJeWGcuX6Wk69oFYD"
      ],
      "count" : 2
    }
    

提交回复
热议问题