问题
Suppose I have a given time range. For explanation, let's consider something simple, like whole year 2018. I want to query data from ClickHouse as a sum aggregation for each quarter so the result should be 4 rows.
The problem is that I have data for only two quarters so when using GROUP BY quarter
, only two rows are returned.
SELECT
toStartOfQuarter(created_at) AS time,
sum(metric) metric
FROM mytable
WHERE
created_at >= toDate(1514761200) AND created_at >= toDateTime(1514761200)
AND
created_at <= toDate(1546210800) AND created_at <= toDateTime(1546210800)
GROUP BY time
ORDER BY time
1514761200
– 2018-01-01
1546210800
– 2018-12-31
This returns:
time metric
2018-01-01 345
2018-04-01 123
And I need:
time metric
2018-01-01 345
2018-04-01 123
2018-07-01 0
2018-10-01 0
This is simplified example but in real use case the aggregation would be eg. 5 minutes instead of quarters and GROUP BY would have at least one more attribute like GROUP BY attribute1, time
so desired result is
time metric attribute1
2018-01-01 345 1
2018-01-01 345 2
2018-04-01 123 1
2018-04-01 123 2
2018-07-01 0 1
2018-07-01 0 2
2018-10-01 0 1
2018-10-01 0 2
Is there a way to somehow fill the whole given interval? Like InfluxDB has fill
argument for group or TimescaleDb's time_bucket() function with generate_series() I tried to search ClickHouse documentation and github issues and it seems this is not implemented yet so the question perhaps is whether there's any workaround.
回答1:
You can generate zero values using the "number" function. Then join your query and zero values using UNION ALL and already according to the obtained data we make a GROUP BY.
So, your query will look like:
SELECT SUM(metric),
time
FROM (
SELECT toStartOfQuarter(toDate(1514761200+number*30*24*3600)) time,
toUInt16(0) AS metric
FROM numbers(30)
UNION ALL
SELECT toStartOfQuarter(created_at) AS time,
metric
FROM mytable
WHERE created_at >= toDate(1514761200)
AND created_at >= toDateTime(1514761200)
AND created_at <= toDate(1546210800)
AND created_at <= toDateTime(1546210800)
)
GROUP BY time
ORDER BY time
note toUInt16(0) - zero values must be of the same type as metrics
回答2:
As an alternative for numbers()
function in some cases range
and array functions can be useful.
Example: for each pair of (id1,id2) dates from the previous 7 days should be generated.
SELECT
id1,
id2,
arrayJoin(
arrayMap( x -> today() - 7 + x, range(7) )
) as date2
FROM table
WHERE date >= now() - 7
GROUP BY id1, id2
The result of that select can be used in UNION ALL to fill the 'holes' in data.
SELECT id1, id2, date, sum(column1)
FROM (
SELECT
id1,
id2,
date,
column1
FROM table
WHERE date >= now() - 7
UNION ALL
SELECT
id1,
id2,
arrayJoin(
arrayMap( x -> today() - 7 + x, range(7) )
) as date2,
0 as column1
FROM table
WHERE date >= now() - 7
GROUP BY id1, id2
)
GROUP BY id1, id2, date
ORDER BY date, id1, id2
回答3:
Here is how I did it for hour buckets (needed to visualize this in Grafana) thanks to @filimonov and @mikhail
SELECT t, SUM(metric) as metric FROM (
SELECT
arrayJoin(
arrayMap( x -> toStartOfHour(addHours(toDateTime($from),x)),
range(toUInt64(
dateDiff('hour',
toDateTime($from),
toDateTime($to)) + 1)))
) as t,
0 as metric
UNION ALL
SELECT
toStartOfHour(my_date) as t,
COUNT(metric)
FROM my_table
WHERE t BETWEEN toDateTime($from) AND toDateTime($to)
GROUP BY t
)
GROUP BY t ORDER BY t
So, for example for range from 2019-01-01 to 2019-01-02 it will give you:
SELECT t, SUM(metric) as metric FROM (
SELECT
arrayJoin(
arrayMap( x -> toStartOfHour(addHours(toDateTime('2019-01-01 00:00:00'),x)),
range(toUInt64(
dateDiff('hour',
toDateTime('2019-01-01 00:00:00'),
toDateTime('2019-01-02 00:00:00')) + 1)))
) as t,
0 as metric
UNION ALL
SELECT
toStartOfHour(my_date) as t,
COUNT(1) as metric
FROM my_table
WHERE t BETWEEN toDateTime('2019-01-01 00:00:00') AND toDateTime('2019-01-02 00:00:00')
GROUP BY t
)
GROUP BY t ORDER BY t;
t |metric|
-------------------|------|
2019-01-01 00:00:00| 0|
2019-01-01 01:00:00| 0|
2019-01-01 02:00:00| 0|
2019-01-01 03:00:00| 0|
2019-01-01 04:00:00| 0|
2019-01-01 05:00:00| 0|
2019-01-01 06:00:00| 0|
2019-01-01 07:00:00|105702|
2019-01-01 08:00:00|113315|
2019-01-01 09:00:00|149837|
2019-01-01 10:00:00|185314|
2019-01-01 11:00:00|246106|
2019-01-01 12:00:00|323036|
2019-01-01 13:00:00| 0|
2019-01-01 14:00:00|409160|
2019-01-01 15:00:00|379113|
2019-01-01 16:00:00|256634|
2019-01-01 17:00:00|286601|
2019-01-01 18:00:00|280039|
2019-01-01 19:00:00|248504|
2019-01-01 20:00:00|218642|
2019-01-01 21:00:00|186152|
2019-01-01 22:00:00|148478|
2019-01-01 23:00:00|109721|
2019-01-02 00:00:00| 0|
来源:https://stackoverflow.com/questions/50238568/how-to-group-by-time-bucket-in-clickhouse-and-fill-missing-data-with-nulls-0s