Get apps with the highest review count since a dynamic series of days

三世轮回 提交于 2021-01-28 21:41:12

问题


I have two tables, apps and reviews (simplified for the sake of discussion):

apps table

id int

reviews table

id          int
review_date date
app_id      int (foreign key that points to apps)

2 questions:

1. How can I write a query / function to answer the following question?:

Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?

I think I know how to write a query if given an explicit date:

SELECT
  apps.id,
  count(reviews.*)
FROM
  reviews
  INNER JOIN apps ON apps.id = reviews.app_id
group by
  1
having
  min(reviews.review_date) >= '2020-01-01'
  order by 2 desc
limit 10;

But I don't know how to query this dynamically given the desired date series and compile all this information in a single view.

2. What's the best way to model this data?

It would be nice to have the # of reviews at the time for each date as well as the app_id. As of now I'm thinking something that might look like:

... 2020-01-01_app_id | 2020-01-01_review_count | 2020-01-02_app_id | 2020-01-02_review_count ...

But I'm wondering if there's a better way to do this. Stitching the data together also seems like a challenge.


回答1:


I think this is what you are looking for:

Postgres 13 or newer

WITH cte AS (  -- MATERIALIZED
   SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
   FROM   reviews
   GROUP  BY 1
   )
SELECT *
FROM  (
   SELECT generate_series(min(review_date)
                        , max(review_date)
                        , '1 day')::date
   FROM   reviews
   ) d(review_window_start)
LEFT  JOIN LATERAL (
   SELECT total_ct, array_agg(app_id) AS apps
   FROM  (
      SELECT app_id, total_ct
      FROM   cte c
      WHERE  c.earliest_review >= d.review_window_start
      ORDER  BY total_ct DESC
      FETCH  FIRST 1 ROWS WITH TIES  -- new & hot
      ) sub
   GROUP  BY 1
   ) a ON true;

WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:

  • Greater than or equal to ALL() and equal to MAX() speed

Postgres 12 or older

WITH cte AS (  -- MATERIALIZED
   SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
   FROM   reviews
   GROUP  BY 1
   )
SELECT *
FROM  (
   SELECT generate_series(min(review_date)
                        , max(review_date)
                        , '1 day')::date
   FROM   reviews
   ) d(review_window_start)
LEFT  JOIN LATERAL (
   SELECT total_ct, array_agg(app_id) AS apps
   FROM  (
      SELECT total_ct, app_id
          ,  rank() OVER (ORDER BY total_ct DESC) AS rnk
      FROM   cte c
      WHERE  c.earliest_review >= d.review_window_start
      ) sub
   WHERE  rnk = 1
   GROUP  BY 1
   ) a ON true;

db<>fiddle here

Same as above, but without WITH TIES.

We don't need to involve the table apps at all. The table reviews has all information we need.

The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:

  • How to force evaluation of subquery before joining / pushing down to foreign server

The optimized generate_series() call produces the series of days from earliest to latest review. See:

  • Generating time series between two dates in PostgreSQL
  • Join a count query on a generate_series in postgres and also retrieve Null-values as "0"

Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See:

  • What is the difference between LATERAL and a subquery in PostgreSQL?



回答2:


If you are looking for hints, then here are a few:

  1. Are you aware of generate_series() and how to use it to compose a table of dates given a start and end date? If not, then there are plenty of examples on this site.
  2. To answer this question for any given date, you need to have only two measures for each app, and only one of these is used to compare an app against other apps. Your query in part 1 shows that you know what these two measures are.
  3. Hints 1 and 2 should be enough to get this done. The only thing I can add is for you not to worry about making the database do "too much work." That is what it is there to do. If it does not do it quickly enough, then you can think about optimizations, but before you get to that step, concentrate on getting the answer that you want.

Please comment if you need further clarification on this.




回答3:


The missing piece for me was lateral join. I can accomplish just about what I want using the following:


select
  review_windows.review_window_start,
  id,
  review_total,
  earliest_review
from
  (
    select
      date_trunc('day', review_windows.review_windows) :: date as review_window_start
    from
      generate_series(
        (
          SELECT
            min(reviews.review_date)
          FROM
            reviews
        ),
        (
          SELECT
            max(reviews.review_date)
          FROM
            reviews
        ),
        '1 year'
      ) review_windows
    order by
      1 desc
  ) review_windows
  left join lateral (
    SELECT
      apps.id,
      count(reviews.*) as review_total,
      min(reviews.review_date) as earliest_review
    FROM
      reviews
      INNER JOIN apps ON apps.id = reviews.app_id
    where
      reviews.review_date >= review_windows.review_window_start
    group by
      1
    having
      min(reviews.review_date) >= review_windows.review_window_start
    order by
      2 desc,
      3 desc
    limit
      2
  ) apps_most_reviews on true;


来源:https://stackoverflow.com/questions/63310905/get-apps-with-the-highest-review-count-since-a-dynamic-series-of-days

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!