Best way to scale data, decrease loading time, make my webhost happy

问题

For a Facebook Application, I have to store a list of friends of a user in my MySQL database. This list is requested from my db, compared with other data, etc.

Currently, I store this list of friends within my user table, the uids of the friends are put together in one 'text' field, with a '|' as separator. For example:

ID - UID - NAME - FRIENDS => 1 - 123456789 - John Doe - 987654321|123456|765432

My PHP file requests this row and extracts the list of friends by exploding that field ('|'). This all works fine, every 1000 users are about 5MB diskspace.

Now the problem:

For an extra feature, I also need to save the names of the friends of the user. I can do this in different ways:

1) Save this data in an extra table. For example:

ID - UID - NAME => 1 - 1234321 - Jane Doe

If I need the name of the friend with ID 1234321, I can request the name from this table. However, the problem is that this table will keep growing, until all users on Facebook are indexed (>500million rows). My webhost is not going to like this! Such a table will take about 25GB of diskspace.

2) Another solution is to extend the data saved in the user table, by adding the name to the UID in the friends field (with an extra separator, let's use ','). For example:

ID - UID - NAME - FRIENDS => 1 - 123456789 - John Doe - 987654321,Mike Jones|123456,Tom Bright|765432,Rick Smith

For this solution I have to alter the script, to add another extra explode (','), etc. I'm not sure how many extra diskspace this is going to take... But the data doesn't get easy to handle this way!

3) A third solution gives a good overview of all the data, but will cause the database to be huge. In this solution we create a table of friends, with a row for every friendship. For example:

ID - UID - FRIENDUID => 1 - 123456789 - 54321

ID - UID - FRIENDUID => 3 - 123456789 - 65432

ID - UID - FRIENDUID => 2 - 987654321 - 54321

ID - UID - FRIENDUID => 4 - 987654321 - 65432

As you can see in this example, it gives a very good overview of all the friendships. However, with about 500million users, and let's say an average of 300 friendships per user, this will create a table with 150billion rows. My host is definitely not going to like that... AND I think this kind of table will take a lot of diskspace...

So... How to solve this problem? What do you think, what is the best way to store the UIDs + names of friends of a user on Facebook? How to scale this kind of data? Or do you have another (better) solution than the three possibilities mentioned above?

Hope you can help me!

回答1:

I agree with Amber, solution 1 is going to be the most efficient way to store this data. If you want to stick with your current approach (similar to solution 2), you may want to consider storing the friendship data as a JSON string. It won't produce the shortest possible string, but it will be very easy to parse.

To save the data:

$friends = array(
    'uid1' => 'John Smith',
    'uid2' => 'Jane Doe'
);

$str = json_encode($friends);

// save $str to the database in the "friends" column

To get the data back:

// get $str from the database

$friends = json_decode($str, TRUE);

var_dump($friends);

回答2:

If I need the name of the friend with ID 1234321, I can request the name from this table. However, the problem is that this table will keep growing, until all users on Facebook are indexed (>500million rows). My webhost is not going to like this! Such a table will take about 25GB of diskspace.

If storing the names of the users you need really takes 25GB, then it takes 25GB. You can't move data around and expect it to get smaller - and the overhead of a table is not that much. Instead, you need to focus on only storing the data you actually need. It is unlikely that everyone on Facebook uses your application (if it were the case, you shouldn't be using a host where 25GB of space is a worry).

So instead of indexing the entirety of Facebook (which would be difficult regardless), just store the data relevant for the people who actually use your application and their immediate friends, which is a much smaller dataset.

Your first proposed solution is the proper way to do it; it eliminates any potential redundancy in name storage.

回答3:

I really think you should go with the third option. For scalability you would want to do this.
With the first method you have a LOT of redundant data because if 1 is friends with 2, 2 is also friends with 1. But you are storing both relations.
This also makes the 150 billion row count impossible. It is more likely that this will be at most half, because the relations table can work both ways!!
So the first user will generate 300 rows in the table, but the second user (if he is friends with 1) will generate just 299. Continue to do so and the last user won't even generate a relation row, because they are all already present!
Also when you want to start searching for certain relations the third option will be much faster since you'll have a int index in stead of a fulltext index which probably saves another 50% in both storage and processing speed.

If your application will reach 500 million users you will just have to get a better hosting service.

来源：https://stackoverflow.com/questions/5096127/best-way-to-scale-data-decrease-loading-time-make-my-webhost-happy

标签

php

mysql

database

facebook

scaling