问题
I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,
{ \"_id\" = ObjectId(\"50731xxxxxxxxxxxxxxxxxxxx\"),
\"source_references\" : [
\"_id\" : ObjectId(\"5045xxxxxxxxxxxxxx\"),
\"name\" : \"xxx\",
\"key\" : 123
]
}
I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).
I want to remove duplicate records based on source_references.key, I\'m thinking of writing some PHP code to traverse each record and remove the record if exists.
Is there a way to remove the duplicates in Mongo Internal command line?
回答1:
If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:
db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})
This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.
Important Notes:
- The
dropDupsoption was removed in MongoDB 3.0, so a different approach will be required. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key. - Any documents missing the
source_references.keyfield will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with asource_references.keyfield.
Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.
回答2:
This is the easiest query I used on my MongoDB 3.2
db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})
Index your customKey before running this to increase speed
回答3:
While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options
- Let the MongoDB do that for you using Map Reduce
- Another way
- You do programatically which is less efficient.
回答4:
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
回答5:
pip install mongo_remove_duplicate_indexes
- create a script in any language
- iterate over your collection
- create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection db.createCollection("cname") create new index db.cname.createIndex({'genre':1},unique:1) now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
- now just insert the json format values u received into new collection and handle exception using exception handling for ex pymongo.errors.DuplicateKeyError
check out the package source code for the mongo_remove_duplicate_indexes for better understanding
回答6:
If you have enough memory, you can in scala do something like that:
cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})
来源:https://stackoverflow.com/questions/13190370/how-to-remove-duplicates-based-on-a-key-in-mongodb