How to remove duplicates based on a key in Mongodb?

坚强是说给别人听的谎言 提交于 2019-11-26 11:57:31

问题


I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,

 { \"_id\" = ObjectId(\"50731xxxxxxxxxxxxxxxxxxxx\"),
   \"source_references\" : [
                           \"_id\" : ObjectId(\"5045xxxxxxxxxxxxxx\"),
                           \"name\" : \"xxx\",
                           \"key\" : 123
                          ]
 }

I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).

I want to remove duplicate records based on source_references.key, I\'m thinking of writing some PHP code to traverse each record and remove the record if exists.

Is there a way to remove the duplicates in Mongo Internal command line?


回答1:


If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:

db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})

This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.

Important Notes:

  • The dropDups option was removed in MongoDB 3.0, so a different approach will be required. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.
  • Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.

Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.




回答2:


This is the easiest query I used on my MongoDB 3.2

db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
    db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})

Index your customKey before running this to increase speed




回答3:


While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options

  1. Let the MongoDB do that for you using Map Reduce
    • Another way
  2. You do programatically which is less efficient.



回答4:


Here is a slightly more 'manual' way of doing it:

Essentially, first, get a list of all the unique keys you are interested.

Then perform a search using each of those keys and delete if that search returns bigger than one.

    db.collection.distinct("key").forEach((num)=>{
      var i = 0;
      db.collection.find({key: num}).forEach((doc)=>{
        if (i)   db.collection.remove({key: num}, { justOne: true })
        i++
      })
    });



回答5:


pip install mongo_remove_duplicate_indexes

  1. create a script in any language
  2. iterate over your collection
  3. create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection db.createCollection("cname") create new index db.cname.createIndex({'genre':1},unique:1) now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
  4. now just insert the json format values u received into new collection and handle exception using exception handling for ex pymongo.errors.DuplicateKeyError

check out the package source code for the mongo_remove_duplicate_indexes for better understanding




回答6:


If you have enough memory, you can in scala do something like that:

cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})


来源:https://stackoverflow.com/questions/13190370/how-to-remove-duplicates-based-on-a-key-in-mongodb

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!