Searching across shards?

杀马特。学长 韩版系。学妹 提交于 2019-12-03 08:04:21

There is no magic bullet.

Searching each shard in succession is out of the question, obviously, due to the incredibly high latency you will incur.

So you want to search in parallel, if you have to.

There are two realistic options, and you already listed them -- indexing, and parallelized search. Allow me to go into a little more detail on how you would go about designing them.

The key insight you can use is that in search, you rarely need the complete set of results. You only need the first (or nth) page of results. So there is quite a bit of wiggle room you can use to decrease response time.

Indexing

If you know the attributes on which the users will be searched, you can create custom, separate indexes for them. You can build your own inverted index, which will point to the (shard, recordId) tuple for each search term, or you can store it in the database. Update it lazily, and asynchronously. I do not know your application requirements, it might even be possible to just rebuild the index every night (meaning you will not have the most recent entries on any given day -- but that might be ok for you). Make sure to optimize this index for size so it can fit in memory; note that you can shard this index, if you need to.

Naturally, if people can search for something like "lastname='Smith' OR lastname='Jones'", you can read the index for Smith, read the index for Jones, and compute the union -- you do not need to store all possible queries, just their building parts.

Parallel Search

For every query, send off requests to every shard unless you know which shard to look for because the search happens to be on the distribution key. Make the requests asynchronous. Reply to the user as soon as you get the first page-worth of results; collect the rest and cache locally, so that if the user hits "next" you will have the results ready and do not need to re-query the servers. This way, if some of the servers are taking longer than others, you do not need to wait on them to service the request.

While you are at it, log the response times of the sharded servers to observe potential problems with uneven data and/or load distribution.

I'm assuming you are talking about shards a la : http://highscalability.com/unorthodox-approach-database-design-coming-shard

If you read that article he goes into some detail on exactly your question, but long answer short, you write custom application code to bring your disparate shards together. You can do some smart hashing to both query individual shards and insert data into shards. You need to ask a more specific question to get a more specific answer.

You actually do need every search to hit every shard, or at least every search needs to be performed against an index that contains the data from all shards, which boils down to the same thing.

Presumably you shard based on a single property of the user, probably a hash of the username. If your search feature allows the user to search based on other properties of the user it is clear that there is no single shard or subset of shards that can satisfy a query, because any shard could contain users that match the query. You can't rule out any shards before performing the search, which implies that you must run the query against all shards.

You may want to look at Sphinx (http://www.sphinxsearch.com/articles.html). It supports distributed searching. GigaSpaces has parallel query and merge support. This can also be done with MySQL Proxy (http://jan.kneschke.de/2008/6/2/mysql-proxy-merging-resultsets).

To build a non-sharded indexed kinds of defeats the purpose of the shard in the first place :-) A centralized index probably won't work if shards were necessary.

I think all the shards need to be hit in parallel. The results need to be filtered, ranked, sorted, grouped and the results merged from all the shards. If the shards themselves become overwhelmed you have to do the usual (reshard, scale up, etc) to underwhelm them again.

RDBMs are not good tool for textual search. You will be much better off looking at Solr. Performance difference between Solr and database will be in the order of magnitude of 100X.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!