How to avoid merging high cardinality sub-select aggregations on distributed tables

问题

In Clickhouse, I have a large table A with following columns:

date, user_id, operator, active

In table A, events are already pre-aggregated over date, user_id and operator, while column 'active' indicates presence of certain kind of activity of user on given date.

Table A is distributed over 2 shards/servers: First I created table A_local on each server (PK is date, user_id). Then I created distributed table A to merge local tables A_local by using hash(userid, operator) as sharding key. User_id is high cardinality field (tens to hundreds of millions), while column 'operator' has low cardinality (around 1000 distinct values). Every user_id belongs to one single operator, that is to say that tuple(user_id, operator) has the same cardinality as user_id itself.

I need to compute number of users per operator which have been active more than N days in a given period. To achieve that, I first need to find for each user_id number of days when user was active in a given period, which I do in subselect. Then, in main select I count users grouped by operator.

SELECT
    operator,
    count() AS cnt_user
FROM
(
    SELECT
        user_id,
        operator,
        count() AS cnt
    FROM A
    WHERE date >= '2019-06-01' AND date <= '2019-08-31'
    AND active = 1
    GROUP BY
        user_id,
        operator
    HAVING cnt >= 30
)
GROUP BY operator

The idea of sharding by using user_id and operator is to have users routed to different shards. That way, I was hoping that complete query (select and subselect) can be run independently on each shard/server, then final aggregation would be performed over small cardinality set: operator -> count.

However, when I run this query over large period of time (several months), Clickhouse throws exception telling that maximum query memory allocation was exceeded. If I run the same query on local table, there is no such exception and results are returned. Clickhouse first merges all records from subselect over both shards, then it computes outer aggregation. Question is how to rewrite query or/and change schema in order to force Clickhouse to perform both aggregations locally then merge low cardinality aggregates (over operator) in the last step? I hoped that having shard key over user_id and operator would make Clickhouse do that naturally, but it seems not to be the case.

回答1:

at each shard

create view xxx as 
SELECT
        user_id,
        operator,
        count() AS cnt
    FROM A_local


    GROUP BY
        user_id,
        operator
    HAVING cnt >= 30

create xxx_d Distributed(,xxx);

select .... 
from xxx_d  
WHERE date >= '2019-06-01' AND date <= '2019-08-31'
            AND active = 1 
GROUP BY operator
settings distributed_group_by_no_merge=1

来源：https://stackoverflow.com/questions/57825939/how-to-avoid-merging-high-cardinality-sub-select-aggregations-on-distributed-tab

标签

ClickHouse