Clickhouse: topK by uniqs or sum of other column

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-24 21:49:25

问题


We're storing sessions in Clickhouse. A row has (among others) a city, a duration, an IP and an agent column. In one aggregation we're grouping by page and calculating the sum of the durations and the uniqs by IP and agent. Also we're aggregating the top 5 cities. But cities are sorted by number of occurences in the database before the top 5 are selected. Is it possible to use uniq visitors (as indicated by agent/IP combo) or the sum of durations to determin the order of the cities?

EDIT (adding a specific query and more explanation):

          SELECT page, day,
            CAST(uniqExact(ip, agent) AS UInt16) AS uniqs,
            topKIf(5)(city, city <> '') AS top_cities,
            sum(duration) AS total_duration
          FROM pageviews
          WHERE day = toDate('2019-12-24')
          GROUP BY page

So which are the top_cities is determined by the number of pageviews with a given city. I'd like the top_cities to be determined by sum(duration) with each city or by the number of uniq ip/agent combos per city.

I'm aware that I could GROUP BY page, city, ip, agent and do the final aggregation in an additional step but this just takes to long for the data set.


回答1:


topK returns an array of most frequent values so it cannot help here.

It looks like need to use a straightforward way like this:

SELECT
    page,
    groupArray((city, metric)) AS cityMetricArray,

    /* Assign each City the numeric unique ID. 
       If your dataset contains CityId then use it instead of this artificial key. */
    arrayMap((x, id) -> (x.1, x.2, id), cityMetricArray, arrayEnumerateDense(arrayMap(x -> (x.1), cityMetricArray))) AS cityMetricCityIdArray,

    /* Calculate the sum of metrics for each city. 
       Unfortunately sumMap-function accepted only numeric array as key-array, otherwise, passing an array with city names as keys would make code more simple.  */
    arrayReduce('sumMap', [arrayMap(x -> x.3, cityMetricCityIdArray)], [arrayMap(x -> x.2, cityMetricCityIdArray)]) AS cityMetricSumArray,

    /* Take 5-top cities Ids. */
    arrayReverseSort((cityId, sumMetric) -> sumMetric, cityMetricSumArray.1, cityMetricSumArray.2) AS cityIds,
    arraySlice(cityIds, 1, 5) AS topNCityIds,

    /* Map cityIds to city names. */
    arrayMap(cityId -> arrayFirst(x -> x.3 = cityId, cityMetricCityIdArray).1, topNCityIds) AS topCities
FROM
(   /* test data */
    SELECT
        data.1 AS city,
        data.2 AS metric,
        'page' AS page
    FROM
    (
        SELECT arrayJoin([
          ('city1', 11), ('city2', 11), ('city3', 11), 
          ('city4', 11), ('city2', 11), ('city4', 22), 
          ('city5', 5), ('city6', 22), ('city7', 10)]) AS data
    )
)
GROUP BY page
FORMAT Vertical

/* Result:
page:                  page
cityMetricArray:       [('city1',11),('city2',11),('city3',11),('city4',11),('city2',11),('city4',22),('city5',5),('city6',22),('city7',10)]
cityMetricCityIdArray: [('city1',11,1),('city2',11,2),('city3',11,3),('city4',11,4),('city2',11,2),('city4',22,4),('city5',5,5),('city6',22,6),('city7',10,7)]
cityMetricSumArray:    ([1,2,3,4,5,6,7],[11,22,11,33,5,22,10])
cityIds:               [4,2,6,1,3,7,5]
topNCityIds:           [4,2,6,1,3]
topCities:             ['city4','city2','city6','city1','city3']
*/


来源:https://stackoverflow.com/questions/59178819/clickhouse-topk-by-uniqs-or-sum-of-other-column

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!