Speed up MongoDB aggregation

风流意气都作罢 提交于 2019-12-19 04:09:50

问题


I have a sharded collection "my_collection" with the following structure:

{ 
   "CREATED_DATE" : ISODate(...),
   "MESSAGE" : "Test Message",
   "LOG_TYPE": "EVENT"
}

The mongoDB environment is sharded with 2 shards. The above collection is sharded using Hashed shard key on LOG_TYPE. There are 7 more other possibilities for LOG_TYPE attribute.

I have 1 million documents in "my_collection" and I am trying to find the count of documents based on the LOG_TYPE using the following query:

db.my_collection.aggregate([
    { "$group" :{ 
        "_id": "$LOG_TYPE",
        "COUNT": { "$sum":1 }
    }}
])

But this is getting me result in about 3 seconds. Is there any way to improve this? Also when I run the explain command, it shows that no Index has been used. Does the group command doesn't use an Index?


回答1:


There are currently some limitations in what aggregation framework can do to improve the performance of your query, but you can help it the following way:

db.my_collection.aggregate([
    { "$sort" : { "LOG_TYPE" : 1 } },
    { "$group" :{ 
        "_id": "$LOG_TYPE",
        "COUNT": { "$sum":1 }
    }}
])

By adding a sort on LOG_TYPE you will be "forcing" the optimizer to use an index on LOG_TYPE to get the documents in order. This will improve the performance in several ways, but differently depending on the version being used.

On real data if you have the data coming into the $group stage sorted, it will improve the efficiency of accumulation of the totals. You can see the different query plans where with $sort it will use the shard key index. The improvement this gives in actual performance will depend on the number of values in each "bucket" - in general LOG_TYPE having only seven distinct values makes it an extremely poor shard key, but it does mean that it all likelihood the following code will be a lot faster than even optimized aggregation:

db.my_collection.distinct("LOG_TYPE").forEach(function(lt) {
   print(db.my_collection.count({"LOG_TYPE":lt});
});



回答2:


There are a limited number of things that you can do in MongoDB, at the end of the day this might be a physical problem that extends beyond MongoDB itself, maybe latency causing configsrvs to respond untimely or results to be brought back from shards too slowly.

However you might be able to solve some performane problems by using a covered query. Since you are in fact sharding on LOG_TYPE you will already have an index on it (required before you can shard on it), not only that but the aggregation framework will auto add projection so that won't help.

MongoDB is likely having to communicate to every shard for the results, otherwise called a scatter and gather operation.

$group on its own will not use an index.

This is my results on 2.4.9:

> db.t.ensureIndex({log_type:1})
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
        "serverPipeline" : [
                {
                        "query" : {

                        },
                        "projection" : {
                                "log_type" : 1,
                                "_id" : 0
                        },
                        "cursor" : {
                                "cursor" : "BasicCursor",
                                "isMultiKey" : false,
                                "n" : 1,
                                "nscannedObjects" : 1,
                                "nscanned" : 1,
                                "nscannedObjectsAllPlans" : 1,
                                "nscannedAllPlans" : 1,
                                "scanAndOrder" : false,
                                "indexOnly" : false,
                                "nYields" : 0,
                                "nChunkSkips" : 0,
                                "millis" : 0,
                                "indexBounds" : {

                                },
                                "allPlans" : [
                                        {
                                                "cursor" : "BasicCursor",
                                                "n" : 1,
                                                "nscannedObjects" : 1,
                                                "nscanned" : 1,
                                                "indexBounds" : {

                                                }
                                        }
                                ],
                                "server" : "ubuntu:27017"
                        }
                },
                {
                        "$group" : {
                                "_id" : "$log_type"
                        }
                }
        ],
        "ok" : 1
}

This is the result from 2.6:

> use gthtg
switched to db gthtg
> db.t.insert({log_type:"fdf"})
WriteResult({ "nInserted" : 1 })
> db.t.ensureIndex({log_type: 1})
{ "numIndexesBefore" : 2, "note" : "all indexes already exist", "ok" : 1 }
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {

                                },
                                "fields" : {
                                        "log_type" : 1,
                                        "_id" : 0
                                },
                                "plan" : {
                                        "cursor" : "BasicCursor",
                                        "isMultiKey" : false,
                                        "scanAndOrder" : false,
                                        "allPlans" : [
                                                {
                                                        "cursor" : "BasicCursor",
                                                        "isMultiKey" : false,
                                                        "scanAndOrder" : false
                                                }
                                        ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$log_type"
                        }
                }
        ],
        "ok" : 1
}


来源:https://stackoverflow.com/questions/23428314/speed-up-mongodb-aggregation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!