Remove Duplicates from MongoDB

后端 未结 1 2070
轻奢々
轻奢々 2020-11-28 14:00

hi I have a ~5 million documents in mongodb (replication) each document 43 fields. how to remove duplicate document. I tryed

db.testkdd.ensureIndex({
                


        
相关标签:
1条回答
  • 2020-11-28 14:39

    The "dropDups" syntax for index creation has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0. It is not a very good idea in most cases to use this as the "removal" is arbitrary and any "duplicate" could be removed. Which means what gets "removed" may not be what you really want removed.

    Anyhow, you are running into an "index length" error since the value of the index key here would be longer that is allowed. Generally speaking, you are not "meant" to index 43 fields in any normal application.

    If you want to remove the "duplicates" from a collection then your best bet is to run an aggregation query to determine which documents contain "duplicate" data and then cycle through that list removing "all but one" of the already "unique" _id values from the target collection. This can be done with "Bulk" operations for maximum efficiency.

    NOTE: I do find it hard to believe that your documents actually contain 43 "unique" fields. It is likely that "all you need" is to simply identify only those fields that make the document "unique" and then follow the process as outlined below:

    var bulk = db.testkdd.initializeOrderedBulkOp(),
        count = 0;
    
    // List "all" fields that make a document "unique" in the `_id`
    // I am only listing some for example purposes to follow
    db.testkdd.aggregate([
        { "$group": {
            "_id": {
               "duration" : "$duration",
              "protocol_type": "$protocol_type", 
              "service": "$service",
              "flag": "$flag"
            },
            "ids": { "$push": "$_id" },
            "count": { "$sum": 1 }
        }},
        { "$match": { "count": { "$gt": 1 } } }
    ],{ "allowDiskUse": true}).forEach(function(doc) {
        doc.ids.shift();     // remove first match
        bulk.find({ "_id": { "$in": doc.ids } }).remove();  // removes all $in list
        count++;
    
        // Execute 1 in 1000 and re-init
        if ( count % 1000 == 0 ) {
           bulk.execute();
           bulk = db.testkdd.initializeOrderedBulkOp();
        }
    });
    
    if ( count % 1000 != 0 ) 
        bulk.execute();
    

    If you have a MongoDB version "lower" than 2.6 and don't have bulk operations then you can try with standard .remove() inside the loop as well. Also noting that .aggregate() will not return a cursor here and the looping must change to:

    db.testkdd.aggregate([
       // pipeline as above
    ]).result.forEach(function(doc) {
        doc.ids.shift();  
        db.testkdd.remove({ "_id": { "$in": doc.ids } });
    });
    

    But do make sure to look at your documents closely and only include "just" the "unique" fields you expect to be part of the grouping _id. Otherwise you end up removing nothing at all, since there are no duplicates there.

    0 讨论(0)
提交回复
热议问题