问题
I've got four MySQL tables:
users (id, name)
polls (id, text)
options (id, poll_id, text)
responses (id, poll_id, option_id, user_id)
Given a particular poll and a particular option, I'd like to generate a table that shows which options from other polls are most strongly correlated.
Suppose this is our data set:
TABLE users:
+------+-------+
| id | name |
+------+-------+
| 1 | Abe |
| 2 | Bob |
| 3 | Che |
| 4 | Den |
+------+-------+
TABLE polls:
+------+-----------------------+
| id | text |
+------+-----------------------+
| 1 | Do you like apples? |
| 2 | What is your gender? |
| 3 | What is your height? |
| 4 | Do you like polls? |
+------+-----------------------+
TABLE options:
+------+----------+---------+
| id | poll_id | text |
+------+----------+---------+
| 1 | 1 | Yes |
| 2 | 1 | No |
| 3 | 2 | Male |
| 4 | 2 | Female |
| 5 | 3 | Short |
| 6 | 3 | Tall |
| 7 | 4 | Yes |
| 8 | 4 | No |
+------+----------+---------+
TABLE responses:
+------+----------+------------+----------+
| id | poll_id | option_id | user_id |
+------+----------+------------+----------+
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 3 |
| 4 | 1 | 2 | 4 |
| 5 | 2 | 3 | 1 |
| 6 | 2 | 3 | 2 |
| 7 | 2 | 3 | 3 |
| 8 | 2 | 4 | 4 |
| 9 | 3 | 5 | 1 |
| 10 | 3 | 6 | 2 |
| 10 | 3 | 5 | 3 |
| 10 | 3 | 6 | 4 |
| 10 | 4 | 7 | 1 |
| 10 | 4 | 7 | 2 |
| 10 | 4 | 7 | 3 |
| 10 | 4 | 7 | 4 |
+------+----------+------------+----------+
Given the poll ID 1 and the option ID 2, the generated table should be something like this:
+----------+------------+-----------------------+
| poll_id | option_id | percent_correlated |
+----------+------------+-----------------------+
| 4 | 7 | 100 |
| 2 | 3 | 66.66 |
| 3 | 6 | 66.66 |
| 2 | 4 | 33.33 |
| 3 | 5 | 33.33 |
| 4 | 8 | 0 |
+----------+------------+-----------------------+
So basically, we're identifying all of the users who responded to poll ID 1 and selected option ID 2, and we're looking through all the other polls to see what percentage of them also selected each other option.
回答1:
Don't have an instance handy to test, can you see if this gets proper results:
select
poll_id,
option_id,
((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
n
from
(
select
poll_id,
option_id,
SUM(score) AS sum1,
SUM(score_rev) AS sum2,
SUM(score * score) AS sum1sq,
SUM(score_rev * score_rev) AS sum2sq,
SUM(score * score_rev) AS psum,
COUNT(*) AS n
from
(
select
responses.poll_id,
responses.option_id,
CASE
WHEN user_resp.user_id IS NULL THEN SELECT 0
ELSE SELECT 1
END CASE as score,
CASE
WHEN user_resp.user_id IS NULL THEN SELECT 1
ELSE SELECT 0
END CASE as score_rev,
from responses left outer join
(
select
user_id
from
responses
where
poll_id = 1 and
option_id = 2
)user_resp
ON (user_resp.user_id = responses.user_id)
) temp1
group by
poll_id,
option_id
)components
回答2:
After a few hours of trial and error, I managed to put together a query that works correctly:
SELECT poll_id AS p_id,
option_id AS o_id,
COUNT(*) AS optCount,
(SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) /
(SELECT COUNT(*) FROM response WHERE poll_id = p_id AND user_id IN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2'))
AS percentage
FROM response
INNER JOIN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids
ON response.user_id = user_ids.user_id
WHERE poll_id != '1'
GROUP BY option_id DESC
ORDER BY percentage DESC, optCount DESC
Based on a tests with a small data set, this query looks to be reasonably fast, but I'd like to modify it so the "IN" subquery is not repeated three times. Any suggestions?
回答3:
This seems to give the right results for me:
select poll_stats.poll_id,
option_stats.option_id,
(100 * option_responses / poll_responses) as percent_correlated
from (select response.poll_id,
count(*) as poll_responses
from response selecting_response
join response on response.user_id = selecting_response.user_id
where selecting_response.poll_id = 1 and selecting_response.option_id = 2
group by response.poll_id) poll_stats
join (select options.poll_id,
options.id as option_id,
count(response.id) as option_responses
from options
left join response on response.poll_id = options.poll_id
and response.option_id = options.id
and exists (
select 1 from response selecting_response
where selecting_response.user_id = response.user_id
and selecting_response.poll_id = 1
and selecting_response.option_id = 2)
group by options.poll_id, options.id
) as option_stats
on option_stats.poll_id = poll_stats.poll_id
where poll_stats.poll_id <> 1
order by 3 desc, option_responses desc
来源:https://stackoverflow.com/questions/5312166/advanced-mysql-find-correlations-between-poll-responses