Advanced MySQL: Find correlations between poll responses

倾然丶 夕夏残阳落幕 提交于 2019-12-03 21:05:35

Don't have an instance handy to test, can you see if this gets proper results:

select
        poll_id,
        option_id,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
        n
from
(
    select 
        poll_id,
        option_id,
        SUM(score) AS sum1,
        SUM(score_rev) AS sum2,
        SUM(score * score) AS sum1sq,
        SUM(score_rev * score_rev) AS sum2sq,
        SUM(score * score_rev) AS psum,
        COUNT(*) AS n
    from
    (
            select 
                responses.poll_id, 
                responses.option_id,
                CASE 
                    WHEN user_resp.user_id IS NULL THEN SELECT 0
                    ELSE SELECT 1
                END CASE as score,
                CASE 
                    WHEN user_resp.user_id IS NULL THEN SELECT 1
                    ELSE SELECT 0
                END CASE as score_rev,
            from responses left outer join 
                    (
                        select 
                            user_id
                        from 
                            responses 
                        where
                            poll_id = 1 and 
                            option_id = 2
                    )user_resp  
                        ON (user_resp.user_id = responses.user_id)
    ) temp1 
    group by
        poll_id,
        option_id
)components 

After a few hours of trial and error, I managed to put together a query that works correctly:

SELECT poll_id AS p_id, 
       option_id AS o_id, 
       COUNT(*) AS optCount, 

       (SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN 
          (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) /
       (SELECT COUNT(*) FROM response WHERE poll_id = p_id  AND user_id IN 
          (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) 
       AS percentage 

FROM response 
INNER JOIN 
   (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids
ON response.user_id = user_ids.user_id
WHERE poll_id != '1' 

GROUP BY option_id DESC 
ORDER BY percentage DESC, optCount DESC

Based on a tests with a small data set, this query looks to be reasonably fast, but I'd like to modify it so the "IN" subquery is not repeated three times. Any suggestions?

This seems to give the right results for me:

select poll_stats.poll_id,
       option_stats.option_id,
       (100 * option_responses / poll_responses) as percent_correlated
from (select response.poll_id,
             count(*) as poll_responses
      from response selecting_response
           join response on response.user_id = selecting_response.user_id
      where selecting_response.poll_id = 1 and selecting_response.option_id = 2
      group by response.poll_id) poll_stats
      join (select options.poll_id,
                   options.id as option_id,
                   count(response.id) as option_responses
            from options
                 left join response on response.poll_id = options.poll_id
                           and response.option_id = options.id
                           and exists (
                            select 1 from response selecting_response
                            where selecting_response.user_id = response.user_id
                                  and selecting_response.poll_id = 1
                                  and selecting_response.option_id = 2)
            group by options.poll_id, options.id
           ) as option_stats
       on option_stats.poll_id = poll_stats.poll_id
where poll_stats.poll_id <> 1
order by 3 desc, option_responses desc
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!