问题
I have a table INTERACTIONS
CustomerID | Channel | Response
-----------+---------+----------
245 | SMS | Accept
245 | PUSH | Ignore
247 | SMS | Accept
249 | PUSH | Ignore
When I make request
SELECT COUNT(DISTINCT CUSTOMERID) AS Customers
FROM INTERACTIONS;
I get result 7440
When I make query with group by Channel, and then calculate sum for all groups:
SELECT SUM(CUSTOMERS)
FROM
(SELECT
CHANNEL,
COUNT(DISTINCT CUSTOMERID) AS Customers
FROM
INTERACTIONS
GROUP BY
CHANNEL);
I get result 9993
Why? What's wrong? I expect that number of all customers is the same.
回答1:
It is right there in your sample data. The distinct customers are:
245, 247, 249
When you group by channel the 245
customer appears separately for PUSH and SMS:
SMS | 245, 247
PUSH | 245, 249
Thus COUNT(DISTINCT x) GROUP BY y
could be greater than COUNT(DISTINCT x) -- NO GROUP BY
.
回答2:
SELECT CHANNEL,
COUNT(DISTINCT CUSTOMERID) AS Customers
FROM INTERACTIONS
GROUP BY CHANNEL
That query gives you distinct CUSTOMERID
per Channel. It is possible that same CUSTOMERID
values exist among different Channels, thus they would be counted that many times in the final sum (9993).
You could check that out by converting the query to this one, that would give you the number of Channels per CUSTOMERID:
SELECT CUSTOMERID,
COUNT(DISTINCT CHANNEL) AS Channels
FROM INTERACTIONS
GROUP BY CHANNEL
HAVING COUNT(DISTINCT CHANNEL) > 1
回答3:
you got different result because different CHANNEL PUSH
and SMS
contains same id 245 , as a result when you COUNT(DISTINCT CUSTOMERID)
in 1st query it will return 1 but when
you applied group by CHANNEL it will return per group 1 so your 2nd query 245 id will make push=1 and sms=1
and final query sum() will make it 2 which is different result
来源:https://stackoverflow.com/questions/53556396/why-i-got-incorrect-calculation-of-count-distinct-with-group-by