Mysql count rows using filters on high traffic database

问题

Let's say you have a search form, with multiple select fields, let's say a user selects from a dropdown an option, but before he submits the data I need to display the count of the rows in the database .

So let's say the site has at least 300k(300.000) visitors a day, and a user selects options from the form at least 40 times a visit, that would mean 12M ajax requests + 12M count queries on the database, which seems a bit too much .

The question is how can one implement a fast count (using php(Zend Framework) and MySQL) so that the additional 12M queries on the database won't affect the load of the site .

One solution would be to have a table that stores all combinations of select fields and their respective counts (when a product is added or deleted from the products table the table storing the count would be updated). Although this is not such a good idea when for 8 filters (select options) out of 43 there would be +8M rows inserted that need to be managed.

Any other thoughts on how to achieve this?

p.s. I don't need code examples but the idea itself that would work in this scenario.

回答1:

I would suggest a separate table that caches the counts, combined with triggers.

In order for it to be fast you make it a memory table and you update it using triggers on the inserts, deletes and updates.

pseudo code:

CREATE TABLE counts (
  id unsigned integer auto_increment primary key
  option integer indexed using hash key
  user_id integer indexed using hash key
  rowcount unsigned integer
  unique key user_option (user, option)
) engine = memory

DELIMITER $$

CREATE TRIGGER ai_tablex_each AFTER UPDATE ON tablex FOR EACH ROW
BEGIN
  IF (old.option <> new.option) OR (old.user_id <> new.user_id) THEN BEGIN
    UPDATE counts c SET c.rowcount = c.rowcount - 1 
      WHERE c.user_id = old.user_id and c.option = old.option; 
    INSERT INTO counts rowcount, user_id, option  
      VALUES (1, new.user_id, new.option)
      ON DUPLICATE KEY SET c.rowcount = c.rowcount + 1; 
  END; END IF;
END $$

DELIMITER ;

Selection of the counts will be instant, and the updates in the trigger should not take very long either because you're using a memory table with hash indexes which have O(1) lookup time.

Links:
Memory engine: http://dev.mysql.com/doc/refman/5.5/en/memory-storage-engine.html
Triggers: http://dev.mysql.com/doc/refman/5.5/en/triggers.html

回答2:

I would probably have an pre-calculated table - as you suggest yourself. Import is that you have an smart mechanism for 2 things:

Easily query which entries are affected by which change.
Have an unique lookup field for an entire form request.

The 8M entries wouldn't be very significant if you have solid keys, as you would only require an direct lookup.

I would go trough the trouble to write specific updates for this table on all places it is necessary. Even with the high amount of changes, this is still efficient. If correctly done you will know which rows you need to update or invalidate when inserting/updating/deleting the product.

Sidenote: Based on your comment. If you need to add code on eight places to cover all spots can be deleted - it might be a good time to refactor and centralize some code.

回答3:

there are few scenarios

mysql has the query cache, you dun have to bother the caching IF the update of table is not that frequently
99% user won't bother how many results that matched, he/she just need the top few records
use the explain - if you notice explain will return how many rows going to matched in the query, is not 100% precise, but should be good enough to act as rough row count

回答4:

Not really what you asked for, but since you have a lot of options and want to count the items available based on the options you should take a look at Lucene and its faceted search. It was made to solve problems like this.

If you do not have the need to have up to date information from the search you can use a queue system to push updates and inserts to Lucene every now and then (so you don't have to bother Lucene with couple of thousand of updates and inserts every day).

回答5:

You really only have three options, and no amount of searching is likely to reveal a fourth:

Count the results manually. O(n) with the total number of the results at query-time.
Store and maintain counts for every combination of filters. O(1) to retrieve the count, but requires O(2^n) storage and O(2^n) time to update all the counts when records change.
Cache counts, only calculating them (per #1) when they're not found in the cache. O(1) when data is in the cache, O(n) otherwise.

It's for this reason that systems that have to scale beyond the trivial - that is, most of them - either cap the number of results they'll count (eg, items in your GMail inbox or unread in Google Reader), estimate the count based on statistics (eg, Google search result counts), or both.

I suppose it's possible you might actually require an exact count for your users, with no limitation, but it's hard to envisage a scenario where that might actually be necessary.

回答6:

A few things you can easily optimise:

Cache all you can allow yourself to cache. The options for your dropdowns, for example, do they need to be fetched by ajax calls? This page answered many of my questions when I implemented memcache, and of course memcached.org has great documentation available too.
Serve anything that can be served statically. Ie, options that don't change frequently could be stored in a flat file as array via cron every hour for example and included with script at runtime.
MySQL with default configuration settings is often sub-optimal for any serious application load and should be tweaked to fit the needs, of the task at hand. Maybe look into memory engine for high performance read-access.

You can have a look at these 3 great-but-very-technical posts on materialized views, as a matter of fact that whole blog is truly a goldmine of performance tips for mysql.

GOod-luck

回答7:

Presumably you're using ajax to make the call to the back end that you're talking about. Use some kind of a chached flat file as an intermediate for the data. Set an expire time of 5 seconds or whatever is appropriate. Name the data file as the query key=value string. In the ajax request if the data file is older than your cooldown time, then refresh, if not, use the value stored in your data file.

Also, you might be underestimating the strength of the mysql query cache mechanism. If you're using mysql query cache, I doubt there would be any significant performance dip over doing it the way I just described. If the query was being query cached by mysql then virtually the only slowdown effect would be from the network layer between your application and mysql.

回答8:

Consider what role replication can play in your architecture. If you need to scale out, you might consider replicating your tables from InnoDB to MyISAM. The MyISAM engine automatically maintains a table count if you are doing count(*) queries. If you are doing count(col) where queries, then you need to rely heavily on well designed indicies. In that case you your count queries might take shape like so:

alter table A add index ixA (a, b);
select count(a) using from A use index(ixA) where a=1 and b=2;

回答9:

I feel crazy for suggesting this as it seems that no-one else has, but have you considered client-side caching? JavaScript isn't terrible at dealing with large lists, especially if they're relatively simple lists.

I know that your ideal is that you have a desire to make the numbers completely accurate, but heuristics are your friend here, especially since synchronization will never be 100% -- a slow connection or high latency due to server-side traffic will make the AJAX request out of date, especially if that data is not a constant. IF THE DATA CAN BE EDITED BY OTHER USERS, SYNCHRONICITY IS IMPOSSIBLE USING AJAX. IF IT CANNOT BE EDITED BY ANYONE ELSE, THEN CLIENT-SIDE CACHING WILL WORK AND IS LIKELY YOUR BEST OPTION. Oh, and if you're using some sort of port connection, then whatever is pushing to the server can simply update all of the other clients until a sync can be accomplished.

If you're willing to do that form of caching, you can also cache the results on the server too and simply refresh the query periodically.

回答10:

As others have suggested, you really need some sort of caching mechanism on the server side. Whether it's a MySQL table or memcache, either would work. But to reduce the number of calls to the server, retrieve the full list of cached counts in one request and cache that locally in javascript. That's a pretty simple way to eliminate almost 12M server hits.

You could probably even store the count information in a cookie which expires in an hour, so subsequent page loads don't need to query again. That's if you don't need real time numbers.

Many of the latest browser also support local storage, which doesn't get passed to the server with every request like cookies do.

You can fit a lot of data into a 1-2K json data structure. So even if you have thousands of possible count options, that is still smaller than your typical image. Just keep in mind maximum cookie sizes if you use cookie caching.

来源：https://stackoverflow.com/questions/6302985/mysql-count-rows-using-filters-on-high-traffic-database

标签

php

mysql

algorithm

count