MongoDB Index optimization when using text-search in the aggregation framework

让人想犯罪 __ 提交于 2019-12-06 05:14:58

问题


We are building a simplified version of a search engine on top of MongoDB.

Sample data set

{ "_id" : 1, "dept" : "tech", "updDate":  ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 2, "dept" : "tech", "updDate":  ISODate("2014-07-27T09:45:35Z"), "description" : "wireless red mouse" }
{ "_id" : 3, "dept" : "kitchen", "updDate":  ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat" }
{ "_id" : 4, "dept" : "kitchen", "updDate":  ISODate("2014-05-27T09:45:35Z"), "description" : "red peeler" }
{ "_id" : 5, "dept" : "food", "updDate":  ISODate("2014-04-27T09:45:35Z"), "description" : "green apple" }
{ "_id" : 6, "dept" : "food", "updDate":  ISODate("2014-01-27T09:45:35Z"), "description" : "red potato" }
{ "_id" : 7, "dept" : "food", "updDate":  ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 8, "dept" : "food", "updDate":  ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 9, "dept" : "food", "updDate":  ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }

We want to avoid using "offset-limit" to paginate the results, in order to do that we are basically using the "seek method" by modifying the 'where/match' clause of the query in order to be able to use an index instead of iterating over the collection to fetch the desired results. For more information about "seek method" I highly recommend reading http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way

Search engines typically order results by score and update date in descendant order. To achieve that we are using the text search function in the aggregation pipeline as follows.

db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})

First page

db.inventory.aggregate(  [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, {$limit:  2 }]  )


{ "_id" : 5, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green apple", "score" : 0.75 }
{ "_id" : 3, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat", "score" : 0.75 }

Second page

db.inventory.aggregate(  [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.75}} , { "$and" : [ { "score" : { "$eq" : 0.75}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-04-27T09:45:35Z")}},{ "$and" : [ { "updDate": { "$eq" : ISODate("2014-04-27T09:45:35Z")}} , { "_id" : { "$lt" : 3}}]}]}]}]}},{$limit:  2 }]  )

{ "_id" : 7, "updDate" : ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
{ "_id" : 9, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }

And the last page

db.inventory.aggregate(  [ { $match: { dept : {$in : ["food","kitchen"]} , "$text" : { "$language" : "en", "$search" : "green"} } }, { $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.6666666666666666}} , { "$and" : [ { "score" : { "$eq" : 0.6666666666666666}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-08-27T09:45:35Z")}} , { "$and" : [ { "updDate" : { "$eq" : ISODate("2014-08-27T09:45:35Z")}} , { "_id" : { "$lt" : 9}}]}]}]}]}}, {$limit:  2 }]  )


{ "_id" : 8, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }

Notice how we order the results by score, updDate and id, and in the second match phase how we try to paginate through them using document's score value, update date and finally the id.

Index creation is taking into account that text queries can't be covered on text index prefix fields, see issue https://jira.mongodb.org/browse/SERVER-13018, although I am not sure that applies to our case.

As "executionStats" and "allPlansExecution" modes do not work in the aggregation framework, see https://jira.mongodb.org/browse/SERVER-19758 I have no clue on how MongoDB tries to resolve the query.

As index intersection does not work for text-search, see https://jira.mongodb.org/browse/SERVER-3071 (resolved at 2.5.5) and http://blog.mongodb.org/post/87790974798/efficient-indexing-in-mongodb-26 where the author says that

As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.

After reading several times sections 3.4(Text Search Tutorials) and 3.5(Indexing Strategies) of https://docs.mongodb.org/manual/MongoDB-indexes-guide-master.pdf without reaching any clear conclusion.

So what is the best index strategy for indexing this collection from the text-search perspective?

One index for the first match phase and another one for the second(pagination) match phase?

db.inventory.createIndex({description:"text", dept: -1})
db.inventory.createIndex({updDate: -1, id:-})

A compound index taking into account fields from both match phases?

db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})

None of the above?

Thanks

来源:https://stackoverflow.com/questions/33436263/mongodb-index-optimization-when-using-text-search-in-the-aggregation-framework

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!