Suppose I have a collection with some set of documents. something like this.
{ \"_id\" : ObjectId(\"4f127fa55e7242718200002d\"), \"id\":1, \"name\" : \"foo\"
For a generic Mongo solution, see the MongoDB cookbook recipe for finding duplicates using group. Note that aggregation is faster and more powerful in that it can return the _id
s of the duplicate records.
For pymongo, the accepted answer (using mapReduce) is not that efficient. Instead, we can use the group method:
$connection = 'mongodb://localhost:27017';
$con = new Mongo($connection); // mongo db connection
$db = $con->test; // database
$collection = $db->prb; // table
$keys = array("name" => 1); Select name field, group by it
// set intial values
$initial = array("count" => 0);
// JavaScript function to perform
$reduce = "function (obj, prev) { prev.count++; }";
$g = $collection->group($keys, $initial, $reduce);
echo "";
print_r($g);
Output will be this :
Array
(
[retval] => Array
(
[0] => Array
(
[name] =>
[count] => 1
)
[1] => Array
(
[name] => MongoDB
[count] => 2
)
)
[count] => 3
[keys] => 2
[ok] => 1
)
The equivalent SQL query would be: SELECT name, COUNT(name) FROM prb GROUP BY name
. Note that we still need to filter out elements with a count of 0 from the array. Again, refer to the MongoDB cookbook recipe for finding duplicates using group for the canonical solution using group
.