How to remove duplicates based on a key in Mongodb?

后端 未结 8 817
伪装坚强ぢ
伪装坚强ぢ 2020-11-30 20:56

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,

 { \"_id\" = ObjectId(\"50731xxxxxxxxxxxxxxxxxx         


        
8条回答
  •  [愿得一人]
    2020-11-30 21:20

    I had a similar requirement but I wanted to retain the latest entry. The following query worked withmy collections with millions of records and duplicates.

    /** Create a array to store all duplicate records ids*/
    var duplicates = [];
    
    /** Start Aggregation pipeline*/
    db.collection.aggregate([
      {
        $match: { /** Add any filter here. Add index for filter keys*/
          filterKey: {
            $exists: false
          }
        }
      },
      {
        $sort: { /** Sort it in such a way that you want to retain first element*/
          createdAt: -1
        }
      },
      {
        $group: {
          _id: {
            key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
          },
          dups: {
            $push: {
              _id: "$_id"
            }
          },
          count: {
            $sum: 1
          }
        }
      },
      {
        $match: {
          count: {
            "$gt": 1
          }
        }
      }
    ],
    {
      allowDiskUse: true
    }).forEach(function(doc){
      doc.dups.shift();
      doc.dups.forEach(function(dupId){
        duplicates.push(dupId._id);
      })
    })
    
    /** Delete the duplicates*/
    var i,j,temparray,chunk = 100000;
    for (i=0,j=duplicates.length; i

提交回复
热议问题