问题
I have a MongoDB aggregate pipeline that contains a number of steps (match on indexed fields, add fields, sort, collapse, sort again, page, project results.) If I comment out all of the steps except the first match step, the query executes super fast (.075 seconds), as it's leveraging the proper index. However, if I then try to perform ANY follow up step, even something as simple as getting the results count, the query then starts taking 27 seconds!!!
Here is the query: (Don't get too caught up in the complexity of it, as the indexes are doing their job in executing it quickly...)
db.runCommand({
aggregate: 'ResidentialProperty',
allowDiskUse: false,
explain: false,
cursor: {},
pipeline:
[
{
"$match" : {
"$and" : [
{
"CountyPlaceId" : 20006073
},
{
"$or" : [
{
"$and" : [
{
"ForSaleGroupId" : {
"$in" : [
2,
3
]
}
},
{
"$or" : [
{
"ForSaleGroupId" : {
"$nin" : [
2,
3
]
}
},
{
"ListDate" : {
"$gte" : ISODate("2019-02-21T00:00:00.000Z")
}
}
]
},
{
"$or" : [
{
"ForSaleGroupId" : {
"$ne" : 3
}
},
{
"PendingSaleDate" : {
"$gte" : ISODate("2019-02-21T00:00:00.000Z")
}
}
]
}
]
},
{
"ForLeaseGroupId" : {
"$in" : [
2,
3
]
},
"$or" : [
{
"ForLeaseGroupId" : {
"$nin" : [
2,
3
]
}
},
{
"ListDate" : {
"$gte" : ISODate("2019-02-21T00:00:00.000Z")
}
}
]
},
{
"DistressedGroupId" : {
"$in" : [
2,
3,
4
]
},
"$or" : [
{
"DistressedGroupId" : 1
},
{
"DistressedDate" : {
"$gte" : ISODate("2019-02-21T00:00:00.000Z")
}
}
]
},
{
"$and" : [
{
"OffMarketGroupId" : {
"$in" : [
3,
8
]
}
},
{
"$or" : [
{
"OffMarketGroupId" : 1
},
{
"OffMarketDate" : {
"$gte" : ISODate("2019-02-21T00:00:00.000Z")
}
}
]
},
{
"$or" : [
{
"OffMarketGroupId" : {
"$nin" : [
7,
8
]
}
},
{
"SoldDate" : {
"$gte" : ISODate("2019-02-21T00:00:00.000Z")
}
},
{
"OffMarketDate" : {
"$gte" : ISODate("2019-02-21T00:00:00.000Z")
}
}
]
}
]
},
{
"$or" : [
{
"ForSaleGroupId" : {
"$ne" : 1
}
},
{
"OffMarketGroupId" : 6
}
],
"ChangedListPriceDate" : {
"$gte" : ISODate("2019-02-21T00:00:00.000Z")
}
}
]
},
{
"$or" : [
{
"ForSaleGroupId" : {
"$ne" : 1
}
},
{
"ForLeaseGroupId" : {
"$ne" : 1
}
},
{
"OffMarketGroupId" : 6
},
{
"IsListingOnly" : true
},
{
"OrgId" : ""
},
{
"OffMarketDate" : {
"$gte" : ISODate("2018-11-23T00:00:00.000Z")
}
}
]
},
{
"PropertyTypeId" : {
"$in" : [
1,
5,
6
]
}
}
]
}
},
// Other steps ommitted, since it's slow regardless...
{ "$count": "Count" }
]
})
Here is what a sample ResidentialProperty document looks like:
{
"_id" : 294401911,
"PropertyId" : 86689647,
"OrgId" : "caclaw-n",
"OrgSecurableId" : 1,
"ListingId" : "19443870",
"Location" : {
"type" : "Point",
"coordinates" : [
-117.316207,
33.104623
]
},
"CountyPlaceId" : 20006073,
"CityPlaceId" : 50611194,
"ZipCodePlaceId" : 70092011,
"MetropolitanAreaPlaceId" : 10041740,
"MinorCivilDivisionPlaceId" : 30002074,
"NeighborhoodPlaceId" : 150813707,
"MacroNeighborhoodPlaceId" : 160051666,
"SubNeighborhoodPlaceId" : null,
"ResidentialNeighborhoodsPlaceId" : 220978234,
"ForSaleGroupId" : 1,
"DistressedGroupId" : 1,
"OffMarketGroupId" : 1,
"ForLeaseGroupId" : 2,
"ForSaleDistressedGroupId" : 1,
"OffMarketDistressedGroupId" : 1,
"ListDate" : ISODate("2019-03-15T00:00:00.000Z"),
"PendingSaleDate" : null,
"OffMarketDate" : null,
"DistressedDate" : null,
"SoldDate" : null,
"ChangedListPriceDate" : null,
"ListPrice" : null,
"ListPriceRangeLow" : null,
"ListPriceRangeHigh" : null,
"ListPricePerSqFt" : null,
"ListPricePerLotSizeSqFt" : null,
"SoldPrice" : 0,
"SoldPricePerSqFt" : 0.0,
"SoldPricePerLotSizeSqFt" : 0.0,
"MonthlyLeaseListPrice" : 6950.0,
"MonthlyLeaseListPricePerSqFt" : 2.5402,
"MonthlyLeaseListPricePerLotSizeSqFt" : 2.5402,
"MonthlyLeaseSoldPrice" : null,
"MonthlyLeaseSoldPricePerSqFt" : null,
"MonthlyLeaseSoldPricePerLotSizeSqFt" : null,
"SoldToListPriceRatio" : 0.0,
"EstimatedToListPriceRatio" : 0.0,
"AppPropertyModeId" : 1,
"PropertyTypeId" : 1,
"PropertySubTypeId" : null,
"Bedrooms" : 4,
"Bathrooms" : 3,
"LivingAreaInSqFt" : 2736,
"LotSizeInSqFt" : NumberLong(5073),
"YearBuilt" : 2004,
"GarageSpaces" : 2,
"BuildingSizeInSqFt" : 2736,
"Units" : 1,
"Rooms" : null,
"NetIncome" : null,
"EstimateTypeId" : 3,
"EstimatedValue" : 1253740,
"EstimatedValuePerSqFt" : 458.2383,
"EstimatedValuePerLotSizeSqFt" : 247.1397,
"CapRate" : null,
"Keywords" : [
"$6,950/month long-term minimum of 30 days. $8,950 June and then $9,950 for July or August. BeautifulWaters End Luxury Home walking distance to the beach. Short or Long term Fully Furnished (1 Month plus) with brand new furnishings & fresh paint & new carpets. Enjoy the beach & golf community lifestyle of Carlsbad, CA in this delightful North County San Diego vacation rental home! This spacious & comfortable two story single family home sits on a cul-de-sac in the gated community of Waters End. Easy walk to the beach and close proximity to the Carlsbad train station, area restaurants, shopping, golf courses, and San Diego theme park attractions. The community also offers many health and beauty spas, yoga, and meditation centers, nearby world-renowned golf courses (such as Torrey Pines, Aviara, and La Costa Resort and Spa) as well as some of the best cycling in all of San Diego County.",
"San Diego (City) (Sd)",
"R1",
"Single Family"
],
"OwnerName" : "Brookside Land Trust, ; State Trustee Services Llc",
"TenantNames" : null,
"Apn" : "214-610-49-00",
"OpenHouseStartDate" : null,
"OpenHouseEndDate" : null,
"ListingPhotoCount" : 25,
"StatusChangedDate" : ISODate("2019-06-28T00:00:00.000Z"),
"SortAddress" : "BrooksideCtZZZZZZZZZZ00000000000000000617ZZZZZCarlsbadCA92011",
"SortOwnerName" : "BrooksideLandTrust,;State",
"ListingIdAlphaNum" : "19443870",
"IsListingOnly" : false
}
The count returns 27,815 results. I don't see this as being an indexing issue, as the first matching step executes so fast. I also don't see this as being an issue with hitting the 100mb in memory limit per aggregation pipeline step, as I'm setting allowDiskUse: false and yet it's still executing the query without erroring.
Also of interest, another aggregation pipeline query against the same collection filters down to 45,081 records after the first match step, and yet when I execute a count after that it returns in only 3 seconds. So the document structure can't really be blamed for this issue.
So what the heck is going on here? Why is the match filtering so fast and yet any operation after, even something as simple as a count, is so incredibly slow? I've tried enabling explain: true and I don't see anything that stands out there. The match operation shows that it's using the proper index. The count operation doesn't include any additional details in the explain.
回答1:
2019 ANSWER
This answer is for MongoDB 4.2
After reading the question and the discussion between you guys, I believe that the issue is resolved but still optimization is a common problem for all who are using MongoDB.
I faced the same problem, and here are the tips for query optimization.
Correct me if I'm wrong :)
1. Add index on collection
Indexes play a vital role in running queries quickly as Indexes are data structures that can store the collection’s data set in a form that is easy to traverse. Queries are efficiently executed with the help of indexes in MongoDB.
You can create a different type of indexes according to your need. Learn more about indexes here, the official MongoDB documentation.
2. Pipeline optimization
- Always use $match before $project, as filters remove extra documents and fields from the next stage.
- Always remember, indexes are used by $match and $sort. So, try to add an index to the fields on which you going to sort or filter documents.
- Try to keep this sequence in your query, use $sort before $limit like $sort + $limit + $skip. Because $sort takes advantage of the index and allows MongoDB to select the required query plan while executing the query.
- Always use $limit before $skip so that skip will be applied to limit Documents.
- Use $project to return only the necessary data in the next stage.
Always create an index on the foreignField attributes in a $lookup. Also, as lookup produces an array, we generally unwind it in next stage. So, instead of unwinding it in next stage unwind it inside the lookup like:
{ $lookup: { from: "Collection", as: "resultingArrays", localField: "x", foreignField: "y", unwinding: { preserveNullAndEmptyArrays: false }
} }
Use allowDiskUse in aggregation, with the help of it aggregation operations can write data to the _tmp subdirectory in the Database Path directory. It is used to perform the large query on temp directory. For example:
db.orders.aggregate( [ { $match: { status: "A" } }, { $group: { _id: "$uid", total: { $sum: 1 } } }, { $sort: { total: -1 } } ], { allowDiskUse: true }, )
3. Rebuild the indexes
If you are creating and deleting indexes quite often then rebuild your indexes. It helps MongoDB to refresh, the previously-stored query plan in, the cache, which keeps on taking over the required query plan, believe me, that issue sucks :(
4. Remove unwanted indexes
Too many indexes take too much time in Create, Update and Delete operation as they need to create index along with their tasks. So, remove them helps a lot.
5. Limiting Documents
In a real-world scenario, fetching complete data present in the database does not help. Also, either you can't display it or the user can't read complete fetched data. So, instead of fetching complete data, fetch data in chunks which helps both you and your client watching that data.
And lastly watching what execution plan is selected by MongoDB helps in figuring out the main issue. So, $explain will help you in figuring that out.
Hope this summary will help you guys, feel free to suggest new points if I missed any. I will add them too.
来源:https://stackoverflow.com/questions/57595803/mongodb-aggregate-pipeline-slow-after-first-match-step