Fastest way to remove duplicate documents in mongodb

六月ゝ 毕业季﹏ 提交于 2019-11-26 19:43:58

Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true}) 

As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.

UPDATE

This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).

dropDups: true option is not available in 3.0.

I have solution with aggregation framework for collecting duplicates and then removing in one go.

It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.

a. Remove all documents in one go

var duplicates = [];

db.collectionName.aggregate([
  { $match: { 
    name: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )    
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})  

b. You can delete documents one by one.

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates
})

Create collection dump with mongodump

Clear collection

Add unique index

Restore collection with mongorestore

I found this solution that works with MongoDB 3.4: I'll assume the field with duplicates is called fieldX

db.collection.aggregate([
{
    // only match documents that have this field
    // you can omit this stage if you don't have missing fieldX
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
    $replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})

Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.

It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).

The next stage groups documents by fieldX, and only inserts the $first document in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.

I had to add allowDiskUse because my collection is large.

You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "

You can save the results to a new collection by adding an $out stage...

Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:

db.collection.aggregate([
{
    // only match documents that have this field
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})
  1. General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/ to retrieve one random id from the duplicate records in the collection.

  2. Delete all the records in the collection other than the random-id that we retrieved from findOne option.

You can do something like this if you are trying to do it in pymongo.

def _run_query():

        try:

            for record in (aggregate_based_on_field(collection)):
                if not record:
                    continue
                _logger.info("Working on Record %s", record)

                try:
                    retain = db.collection.find_one(find_one({'fie1d1': 'x',  'field2':'y'}, {'_id': 1}))
                    _logger.info("_id to retain from duplicates %s", retain['_id'])

                    db.collection.remove({'fie1d1': 'x',  'field2':'y', '_id': {'$ne': retain['_id']}})

                except Exception as ex:
                    _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))

        except Exception as e:
            _logger.error("Mongo error when deleting duplicates %s", str(e))


def aggregate_based_on_field(collection):
    return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])

From the shell:

  1. Replace find_one to findOne
  2. Same remove command should work.

The following method merges documents with the same name while only keeping the unique nodes without duplicating them.

I found using the $out operator to be a simple way. I unwind the array and then group it by adding to set. The $out operator allows the aggregation result to persist [docs]. If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.

Hope this helps.

allowDiskUse may have to be added to the pipeline.

db.collectionName.aggregate([
  {
    $unwind:{path:"$nodes"},
  },
  {
    $group:{
      _id:"$name",
      nodes:{
        $addToSet:"$nodes"
      }
  },
  {
    $project:{
      _id:0,
      name:"$_id.name",
      nodes:1
    }
  },
  {
    $out:"collectionNameWithoutDuplicates"
  }
])

Here is a slightly more 'manual' way of doing it:

Essentially, first, get a list of all the unique keys you are interested.

Then perform a search using each of those keys and delete if that search returns bigger than one.

  db.collection.distinct("key").forEach((num)=>{
    var i = 0;
    db.collection.find({key: num}).forEach((doc)=>{
      if (i)   db.collection.remove({key: num}, { justOne: true })
      i++
    })
  });
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!