quickest sparse matrix access, when disk is involved

问题

Imagine you have a table 'users' with 10 Mio records and a table 'groups' with 1 mio records. In average you have 50 users per group which I would store at least in an rdbms in a table called users2groups. users2groups is in fact a sparse matrix. Only 80% of the full dataset of users and groups fit into available memory. The data for the group membership (users2groups) comes on top, so that if memory is needed to cache group memberships this has to be deallocated from either the users or the groups table or both.

I would like to be able to:

find users quickly by name AND
find groups quickly by name AND
get all users of a group quickly AND
get all groups a user belongs to quickly

From experiences I know, that disk latencies determine your access speed to a good extend. Also you can balance between read and write speed. However before one can do this one has to decide for a database type these days... such as:

Relational DBMS
Key-value stores
Document stores
RDF stores
Object oriented DBMS
Graph DBMS
Search engines
Multivalue DBMS
Native XML DBMS
Wide column stores
Content stores
Navigational DBMS
(compressed) Bitmaps
files
more...?

So the question is which of all these systems or which combinations give the best overall read access performance (with an acceptable write access performance) when RAM capacity is lower then available data considering sparse matixes? ...and how has the memory utilization accross all three entities/tables has to be balanced in the choosen technology...?

Since this is a conceptional question disk space and cpu capacity are out of scope or considered to be available "indefinitely".

Btw. I am aware that searching for names such as user names or group names can efficiently be speeded up by the use of indexes based on hashes (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING)). However the hashes and their indexes are not included in the memory yet, when I said the users and groups table have 80% RAM coverage. That's because I am unclear, if there is not a better integrated solution. At the moment I just assume, that breaking up the whole topic into three pieces users, groups and users2groups is most sensible. I am lacking proof here.

-------------------- UPDATE -------------------------------------

I understand that there are competing concepts on the table:

denormalization to an extend, where I can live with a very reduced set of queries against the disk. (eg. mongodb, etc)
squeeze data so that most of the data fits into memory anyway (eg. compressed bitmap)

As denormalization means: 'blowing up data volumes' these both concepts seem to contradict each other. Are there best practices or scientific or sensible arguments, when to use denormalization and when to use the squeeze data approach? Eg. a KPI saying: If less than 80% fit into memory, go for denormalization or so?

-------------------- UPDATE -------------------------------------

Extra memory costs extra money, most db servers have usually lots of empty disk space getting bored on their feet. So denormalization is cheap. On the other hand denormalization opportunities are limited: Disk latency physically limits the amount of max queries per second, full stop. So too many queries against disk get queued which constrains denormalization to an extend for applications with lots of traffic. Even denormalized data access speed depends on memory to a large extend.

So maybe KPIs are not possible here. Anyway for the given sparse matrix expample how does denormalization and the squeeze data approach need to be balanced? I thought at compressing the users and groups table, leave them in a rdbms and than assign the freed memory to a cache of a document db which serves the users2groups relations. However this introduces a whole set of new issues such as muliple round trips to deal with 2 database systems and more complicated query management. So how to tackle this down?

----------------------- UPDATE -----

As per suggestion of lagivan the sparse matrix with only flagging relations seems to be solved in a sensible way: have 2 tables users and groups and then have in the table users a multipe ID field with IDs related to groups and vice versa have in table groups a multiple ID fields with fields related to users. I guess this solution is not tightly coupled to a specific technology. It could even be implemented in MYSQL via VARBINARY or any blobs.

The outstanding part of the question is related to sparse matrixes that contain some 'juice information' like status or lastupdatedate. So using foreign key arrays would sort of disable those information by concept. So the original questions for such scenario is still open: Which of all these systems or which combinations give the best overall read access performance (with an acceptable write access performance) when RAM capacity is lower then available data considering sparse matixes? ...and how has the memory utilization accross all three entities/tables has to be balanced in the choosen technology...?

回答1:

Taking into account you cannot put the complete dataset into RAM, you'll anyway face performance issues caused by I/O operations. Thus, you can rely only on caching and optimal structure of data (db type). Moreover, I believe you should choose a future-proof solution. I cannot cover all the database types but I'd vote against RDBMS as having a sparse matrix users2groups should be rather inefficient. Here are some options:

Document store (e.g. MongoDB) would be a nice solution if you denormalize users2groups relationship and especially if you use sharding for users and groups to fit them in memory using several servers. So every user document will contain the list of groups and every group will contains the list of users. Thus, having caching in place, as soon as a user record is loaded, you will get the list of groups for free.
RDF/Graph store may be a good solution too. Especially if using SaaS solution is an option for you, then scaling will be much easier. For example, we're using Dydra graph database. In this case the data will be normalized anyway.
Native XML store (e.g. eXist-db) would work too especially if you prefer schema-based data. Here I'd apply the same denormalization as for the document store option.

Finally, I believe the devil is in the details. The resulting performance will mostly depend on the skills and experience of your DBA.

UPDATE

Taking into account additional requirement of storing more data related to users2groups cross references, I advise two options:

Keep denormalization of users2groups (MongoDB case). Considering your use cases, I got the feeling groups is the central point for managing users2groups relation. You mostly need statistics about a group, but not about a user. Then you should store the relation data (creation, modification dates, status) inside the group document. The user document will only contain the list of linked group ids regardless of statuses.
Reject denormalization. Here you can opt for the graph database instead or use an additional RDBMS table or MongoDB collection users2groups. In the latter case you will have to use table joins or multiple requests (MongoDB). Of course, it'll decrease the performance compared to the first option. But it can be more beneficial if you come up with more relation data or more complicated use cases.

回答2:

This aims the part of the question which is related to sparse matrixes that contain some 'juice information' like status or lastupdatedate.

I tried to generalize the first part of the answer: In fact I did not find a real reason, why to change away from RDBMS into any other technology to solve sparse matrixes better. So let's consider RDBMS (where denormalized data can be stored in varbinary or blobs) only. I am a big fan of normalization. However what I learned now, is: Denormalize, if denormalization results in lower memory consumption considering data AND index data. Normalization rules target to optimize data's memory consumption without taking into account, that there are scenarios such as sparse matrixes (with indexed foreign-foreign-key-pairs) that can easily confuse benefits and efforts of normalization. I sort of also (re-)learned, that squeeze data into memory as good as possible is THE key to performance (also lagivan argued on performance leverage based on caching).

Having said that, there are various options to go for the second part on the answer:

the one's from Lagivan's Update +

Have an own sum/track/audit table for every 'juice information' use case - aka denormalize just in time based on aggregated data.
Combine denormalization and normalization, whereby denormalization would just be for the plain relations (see first part of the answer) and normalization with 'juice info' gets a very limited time scope (eg. three months)
Write away the 'juice information' into any log file and than aggregate reports sporadically (eg. once per day) - this is what data warehouses do

The solution is now to calculate memory consumption for each acceptable solution for index AND data and then pick the option with the lowest consumption value. Btw there are different levels of implementation effort related to each option.

来源：https://stackoverflow.com/questions/25748022/quickest-sparse-matrix-access-when-disk-is-involved

标签

bitmap

rdbms

database-performance

graph-databases

key-value-store