问题
A testing mongodb(version 3.0.1) running on Amazon EC2(3.14.33-26.47.amzn1.x86_64, t2.medium: 2 vcpus, 4G mem).
And a collection "access_log"(about 40,000,000 records, 1,000,000 each day), and some indexes on it:
...
db.access_log.ensureIndex({ visit_dt: 1, 'username': 1 })
db.access_log.ensureIndex({ visit_dt: 1, 'file': 1 })
...
When doing following "aggregate", it's extremely slow(takes several hours):
db.access_log.aggregate([
{ "$match": { "visit_dt": { "$gte": ISODate('2015-03-09'), "$lt": ISODate('2015-03-11') } } },
{ "$project": { "file": 1, "_id": 0 } },
{ "$group": { "_id": "$file", "count": { "$sum": 1 } } },
{ "$sort": { "count": -1 } }
])
All fields needed for this aggregation are included in the 2nd index ({ visit_dt: 1, 'file': 1 }, that is "visit_dt_1_file_1").
So I am very confused that why mongodb does not use this index, but the other one.
When explaining plan, always get following information, which I do not understand at all.
Could you please help? Thanks a lot!
> db.access_log.aggregate([
... { "$match": { "visit_dt": { "$gte": ISODate('2015-03-09'), "$lt": ISODate('2015-03-11') } } },
... { "$project": { "file": 1, "_id": 0 } },
... { "$group": { "_id": "$file", "count": { "$sum": 1 } } },
... { "$sort": { "count": -1 } }
... ], { explain: true } );
{
"stages" : [
{
"$cursor" : {
"query" : {
"visit_dt" : {
"$gte" : ISODate("2015-03-09T00:00:00Z"),
"$lt" : ISODate("2015-03-11T00:00:00Z")
}
},
"fields" : {
"file" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "xxxx.access_log",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"visit_dt" : {
"$lt" : ISODate("2015-03-11T00:00:00Z")
}
},
{
"visit_dt" : {
"$gte" : ISODate("2015-03-09T00:00:00Z")
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"visit_dt" : 1,
"username" : 1
},
"indexName" : "visit_dt_1_username_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"visit_dt" : [
"[new Date(1425859200000), new Date(1426032000000))"
],
"username" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [
...
{
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"visit_dt" : 1,
"file" : 1
},
"indexName" : "visit_dt_1_file_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"visit_dt" : [
"[new Date(1425859200000), new Date(1426032000000))"
],
"file" : [
"[MinKey, MaxKey]"
]
}
}
},
...
]
}
}
},
{
"$project" : {
"_id" : false,
"file" : true
}
},
{
"$group" : {
"_id" : "$file",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$sort" : {
"sortKey" : {
"count" : -1
}
}
}
],
"ok" : 1
}
回答1:
You might want to read the docs regarding $sort performance:
$sort operator can take advantage of an index when placed at the beginning of the pipeline or placed before the $project, $unwind, and $group aggregation operators. If $project, $unwind, or $group occur prior to the $sort operation, $sort cannot use any indexes.
Also, keep in mind that it is called 'aggregation pipeline' for a reason. It simply doesn't matter where you sort after matching. So the solution should be pretty simple:
db.access_log.aggregate([
{
"$match": {
"visit_dt": {
"$gte": ISODate('2015-03-09'),
"$lt": ISODate('2015-03-11')
},
"file": {"$exists": true }
}
},
{ "$sort": { "file": 1 } },
{ "$project": { "file": 1, "_id": 0 } },
{ "$group": { "_id": "$file", "count": { "$sum": 1 } } },
{ "$sort": { "count": -1 } }
])
The check wether the file field exists might be unnecessary when it is guaranteed that the field exists in every record. This does not hurt, as there is an index on the field. Same goes with the additional sort: since we made sure that only documents containing a file field enter the pipeline, the index should be used.
回答2:
Thanks to @Markus W Mahlberg.
I changed the the query as following:
db.access_log.aggregate([
{
"$match": {
"visit_dt": {
"$gte": ISODate('2015-03-09'),
"$lt": ISODate('2015-03-11')
},
}
},
{ "$sort": { "visit_dt": 1, "file": 1 } },
{ "$project": { "file": 1, "_id": 0 } },
{ "$group": { "_id": "$file", "count": { "$sum": 1 } } },
{ "$sort": { "count": -1 } }
], { explain: true })
Then got the right execution plan:
...
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"visit_dt" : 1,
"file" : 1
},
"indexName" : "visit_dt_1_file_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"visit_dt" : [
"[new Date(1425859200000), new Date(1426032000000))"
],
"file" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
...
Although it is still somewhat slow, I think it is just because of my CPU, Mem, Disks.
Thank you so much!
回答3:
You can find some information here
If the query planner selects an index, the explain result includes a IXSCAN stage. The stage includes information such as the index key pattern, direction of traversal, and index bounds.
Your explain()
output indicates the a IXSCAN occurred, so it appears your indexes are working as expected.
Try running the aggregate command without sorting or grouping, and you will most likely see much better results - if so, you can narrow the problem down to either of these operations.
If this is not the case, you should also try monitoring the system memory when running this query. What is most likely happening is Mongo cannot keep an index of 40,000,000 records in memory, and so it is swapping index data from disk (very slow) while running the query.
来源:https://stackoverflow.com/questions/29294504/mongodb-seems-to-choose-the-wrong-index-when-doing-aggregate