Build an undirected weighted graph by matching N vertices

问题

Problem:
I want to suggest the top 10 most compatible matches for a particular user, by comparing his/her 'interests' with interests of all others. I'm building an undirected weighted graph between users, where the weight = match score between the two users.

I already have a set of N users: S. For any user U in S, I have a set of interests I. After a long time (a week?) I create a new user U with a set of interests and add it to S. To generate a graph for this new user, I'm comparing interest set I of the new user with the interest sets of all the users in S, iteratively. The problem is with this "all the users" part.

Let's talk about the function for comparing interests. An interest in a set of interests I is a string. I'm comparing two strings/interests using WikipediaMiner (it uses Wikipedia links to infer how closely related two strings are. eg. Billy Jean & Thriller ==> high match, Brad Pitt & Jamaica ==> low match blah blah). I've asked a question about this too (to see if there's a better solution than the one I'm currently using.

So, the above function takes non-negligible time, and in total, it'll take a HUGE time when we compare thousands (maybe millions?) of users and their hundreds of interests. For 100,000 users, I can't afford to make 100,000 user comparisons in a small time (<30sec) in this way. But, I have to give the top 10 recommendations within 30 secs, possibly a preliminary recommendation, and then improve on it in the next 1 min or so, calculate improved recommendations. Simply comparing 1 user vs the N users sequentially is too slow.

Question:
Please suggest an algorithm, method or tool using which I can improve my situation or solve my problem.

回答1:

I could think of only an approach to solve the problem, since the outcomes of below stuff depend on the nature of inter-relation between interests.

=>step:1 As your title says.Build an undirected weighted graph with interests as vertices and the weighted match between them as edges.

=>step:2 - cluster the interests. (Most complex)

Kmeans is a commonly used clustering algo, but works on based on K-Dimensional vector space.refer wiki to see how K-means works. it minimizes the sum of (sum of distance^2 for each point and say the center of the cluster) for all clusters. In your case, there are no dimensions available. so try if you can apply the minimizing logic applied there by creating some kind of rule, for distance between two vertices, higher match => lesser distance and vice versa (what are the different matching levels provided by wiki-miner?). chose the Mean of cluster as say the most connected vertex in the chosen set, page ranking sounds to be a good option for "figuring the most connected vertex ".

"Pair-counting F-Measure" sounds like it suit's your need (weighted graph), check for other options available.

(Note: keep modifying this step untill a right clustering algo is found and the right calibration for distance rule, no of clusters etc are found. )

=>Step:3 - Evaluate the clusters

from here on its like calibrating a couple things to fit your need. Examine the clusters, reevaluate : the number of clusters , inter-cluster distance, distance between vertices inside clusters, size of clusters, time\precision trade-off (compare final - match results without any clustring) goto: step-2 untill this evaluation is satisfactory.

=>step:4 - Examinie new inerest

iterate thru all clusters, calculate conectivity in each cluster, sort clusters based on high connectivity, for the top x% of sorted clusters sort and filter out the highly connected interests.

=>step:5 - Match User

reverse look up set of all users using the interests obtained out of step-4, compare all interests for both users, generate a score.

=>step:6 - Apart form the above you can distribute the load (multiple machines can be used for clusters machine-n clusters) to multiple systems\processors, based on the traffic and stuff.

what is the application for this problem, whats the expected traffic?

Another solution to find the connectivity between the new interest and "set of interests in Cluster" C. Wiki-Miner runs on a set of wiki documents, let me call it the UNIVERSE.

1:for each cluster fetch and maintain(index, lucene might be handy) the "set of high relevent docs"(I am calling it HRDC) out of the UNIVERSE. so you have 'N' HRDC's if you got 'N' clusters.

2:when a new interest comes find "Conectivity with Cluster" = "Hit ratio of interest in HRDC/Hit ratio of interest in UNIVERSE" for each HRDC.

3:Sort "Conectivity with Cluster"'s and choose the Highly connected clusters.

4:Either compare all the vertices in the cluster with the new interest or the highly connected vertices (using Page Ranking), depending on the time\Precision trade off , that suits you.

回答2:

One flaw is that your basing your algorithms complexity on the wrong thing. The real issue is that you have to compare each unique interest against every other unique interest (and that interest against itself).

If all of the interests are unique, then there is probably nothing you can do. However, if you have a lot of duplicate interests you can perhaps speed up the algorithm this way by the following.

Create a graph that associates each interest with the users that have that interest. In such a way that allows for fast look-ups.
Create a graph that shows how each interest relates to each other interest, also in such a way that allows for fast look-ups.

Therefore, when a new user is added, their interests are compared to all other interest and stored in a graph. You can then use that information to build to build a list of users with similar interests. That list of users will then need to be filtered somehow to bring it down to the top 10.

Finally, add that user and their interests to the graph of users and interests. This is done last so that the user with the most closely matched interests isn't the user themselves.

Note: There might be some statistical short cuts that you could do something like this: A is related to B, B is related to C, C is related to D, therefore A is related to B, C, and D. However, to use those kinds of short cuts likely requires a much better understanding of how your comparison function works, which is a bit beyond my expertise.

Approximate solution:

I forgot to mention it earlier, but what your looking when comparing users or interests is a "Nearest neighbor search" in higher dimensions. Meaning, that for exact solutions, a linear search generally works better than data structures. So approximation is probably the best way to go if you need it faster.

To obtain a quick approximate solution (without guarantees as to how close it is), you'll need a data structure that allows for quickly being able to determine which users are likely to be similar to a new user.

One way to build that structure:

Pick 300 random users. These will be the seed users for 300 clusters. Ideally, you'd use the 300 users that are least closely related, but that's probably not practical, still might be wise to ensure that the no seed user is too closely related to the other users (as a sum or average of it's comparison's to other users).
The clusters are then filled by each user joining the cluster whose representative user most closely matches it.
The top ton can then be determined by picking the top 10 users most closely related users from that cluster.

If you ensure that the number of clusters and the users per cluster is always fairly close to sqrt(number of users), then you obtain a fair approximation in O(sqrt(N)) by only checking the points within the cluster. You can improve that approximation by including users in additional clusters and checking the representative users for each cluster. The more clusters you check, the closer you get towards O(N) and an exact solution. Although, there's probably no way to say how close the current solution is to the exact solution. Chances are you start to hit dimishing returns after checking more than a total of log(sqrt(N)) clusters total. Which would put you at O(sqrt(N) log(sqrt(N))).

回答3:

few thoughts ...

Not exactly a graph theory solution.

assuming a finite set of interests. for each user maintain a bit sequence where each interest is a bit representing whether the user has that interest or not. For a new user simply multiply the bit sequence with the existing users bit sequence and find the number of bits in the result which gives an idea of how closely their interests match.

来源：https://stackoverflow.com/questions/14218801/build-an-undirected-weighted-graph-by-matching-n-vertices

标签

graph

matching

graph-algorithm

bigdata