How can I delete duplicates in MongoDb?

走远了吗. 提交于 2019-12-07 03:50:32

问题


I have a large collection (~2.7 million documents) in mongodb, and there are a lot of duplicates. I tried running ensureIndex({id:1}, {unique:true, dropDups:true}) on the collection. Mongo churns away at it for a while before it decides that too many dups on index build with dropDups=true.

How can I add the index and get rid of the duplicates? Or the other way around, what's the best way to delete some dups so that mongo can successfully build the index?

For bonus points, why is there a limit to the number of dups that can be dropped?


回答1:


For bonus points, why is there a limit to the number of dups that can be dropped?

MongoDB is likely doing this to defend itself. If you dropDups on the wrong field, you could hose the entire dataset and lock down the DB with delete operations (which are "as expensive" as writes).

How can I add the index and get rid of the duplicates?

So the first question is why are you creating a unique index on the id field?

MongoDB creates a default _id field that is automatically unique and indexed. By default MongoDB populates the _id with an ObjectId, however, you can override this with whatever value you like. So if you have a ready set of ID values, you can use those.

If you cannot re-import the values, then copy them to a new collection while changing id into _id. You can then drop the old collection and rename the new one. (note that you will get a bunch of "duplicate key errors", ensure that your code catches and ignores them)




回答2:


I came across this question while trying to find a workaround for the "too many dups" problem (without re-creating the collection from source). The way I finally did it is by creating a new collection c2, adding a unique index on the needed field(s) (purely for speed up purpose) and then doing upsert:

db.c1.find().forEach(function(x){db.c2.update({field1:x.field1, field2:x.field2}, x, {upsert:true})})

where the combinations of field1 and field2 should be unique. Then one can just drop the initial collection c1 and rename the new one. This solution, as shown, can work for one or multiple fields.



来源:https://stackoverflow.com/questions/9337123/how-can-i-delete-duplicates-in-mongodb

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!