How do I write a function to compare and rank many sets of boolean (true/false) answers?

问题

I've embarked on a project that is proving considerably more complicated than I'd first imagined. I'm trying to plan a system that is based around boolean (true/false) questions and answers. Users on the system can answer any questions from a large set of boolean (true/false) questions and be presented with a list showing the most similar users (in order of similarity) based on their answers.

I've Googled far and wide but still not come up with much, so I was hoping somebody could point me in the right direction. I'd like to know:

What is the best data structure and method to store this kind of data? I'd originally assumed I could create two tables "questions" and "answers" in an SQL database. However, I'm not wondering if it would be simpler to compare two sets of answers if they were both listed as numerical string. I.e. 0 = not answered, 1 = true, 2 = false. When comparing the strings weights could be added for "not answered" = 0, "same answer" = 1, "opposite answer" = -1 producing a similarity score.

How would I go about comparing two sets of answers? To be able to work out the "similarity" between these sets of answers I'm going to have to write a comparison function. Does anyone know what kind of comparison would best suite this problem? I've looked into sequence alignment and I think this could be the correct way to go but I'm unsure as this requires the data to be in a long sequence, plus the questions aren't related so aren't naturally a sequence.

How do I apply this comparison function to a large set of data? Once I've written the comparison function I could just compare each users answers to every other user's answers, however this doesn't seem very efficient and probably wouldn't scale very well. I've been looking into cluster analysis methods to automatically group users according to similar answers, do you think this could work or does anyone know a better method I could look into?

I'd really appreciate any helpful pointers. Thanks!

回答1:

If you were to set it up in SQL with tables for Users, Questions, and Answers then I believe that the following SQL could be used to get other users with similar responses. Simply add a TOP clause to get the number that you want.

I don't know how performance will be, but that would depend a lot on the size of your data.

SELECT
    U2.userid,
    SUM(CASE
            WHEN A1.answer = A2.answer THEN 1
            WHEN A1.answer <> A2.answer THEN -1
            WHEN A1.answer IS NULL OR A2.answer IS NULL THEN 0  -- A bit redundant, but I like to make it clear
            ELSE 0
        END) AS similarity_score
FROM
    Questions Q
LEFT OUTER JOIN Answers A1 ON
    A1.question_id = Q.question_id AND
    A1.userid = @userid
LEFT OUTER JOIN Answers A2 ON
    A2.question_id = A1.question_id AND
    A2.userid <> A1.userid
LEFT OUTER JOIN Users U2 ON
    U2.userid = A2.userid
GROUP BY
    U2.userid
ORDER BY
    similarity_score DESC

回答2:

Data Storage: I would say a database is a good idea (sounds like the potential for a rather large data set). I don't know how many questions you plan on having but to help with simplifying the analysis (including your SQL queries) a bit you may want to group answers to similar questions in separate tables. And I would agree using a numerical value (byte 0-2) would be a good route to take instead of a boolean or something else. You are computing a similarity score so might as well start with numbers.

Comparison: As far as the comparison itself, i would suggest creating an class SimilarQuestionAnswers that contains a list of bytes and a class UserAnswers that contains a list of these SimilarQuestionAnswers. What this does is it sets up your clusters for the cluster analysis method you mentioned. This way you can add weight to certain clusters. (cluster a is an important cluster so it's score is multiplied by 20 where as cluster b is not as important so its score is only multiplied by 10) This also allows you to apply different comparisons for each cluster if that is needed.

I know you said the questions aren't related but you can still at least group questions by their importance. I think the sequence analysis will still work granted your similarity matrix will be all 1's so that kinda simplifies the problem a bit, but the rest of the math associated with that should be sufficient.

Comparison Applied: This is where having the database back end comes in handy. Use SQL queries to minimize the dataset you are dealing with. If you are comparing one person with everyone else, you can use the SQL sum method on their answers to get a quick and dirty comparison within each cluster and pull only those within a certain threshold. This may result in some overlap but that can be eliminated easily.

Another thought is to also have a table with each user and a column for each cluster with a comparison to a fake user that has answered true to each question. Then you could just query that table for a range around the current users scores for each cluster. This my be faster but less accurate.

Either way in the end you will still need to do the comparison to each of the users you get from that query. So the faster you can make that comparison the better. Try to stick to a formula that involves only +,-,*,/ most of the Math.Whatever() methods can add a lot of time over a large number of calls.

Sorry this was so long, most of the questions were pretty open ended and I had to assume a few details. I hope this helps.

回答3:

I would think you might want a per-question weight that was based on how all users responded. As an extreme case, if 1,000 people answered questions A & B, and the results were A (2Y, 998N) and B (500Y, 500N), the two 'Ys for A count much more than any given pair of Y's from B. And any similar pair from B is somewhat more similar than any pair of Ns from A.

Check out Bayesian Probability

回答4:

Rather than cluster the users, you might also consider clustering the questions (e.g. OkCupid). Then instead of comparing users on all answers, you compare them on the categories.

来源：https://stackoverflow.com/questions/3320306/how-do-i-write-a-function-to-compare-and-rank-many-sets-of-boolean-true-false

标签

sql

language-agnostic

database-design

data-structures

computer-science