Aggregate function to detect trend in PostgreSQL

问题

I'm using a psql DB to store a data structure like so:

datapoint(userId, rank, timestamp)

where timestamp is the Unix Epoch milliseconds timestamp.

In this structure I store the rank of each user each day, so it's like:

UserId   Rank  Timestamp
1        1     1435366459
1        2     1435366458
1        3     1435366457
2        8     1435366456
2        6     1435366455
2        7     1435366454

So, in the sample data above, userId 1 its improving it's rank with each measurement, which means it has a positive trend, while userId 2 is dropping in rank, which means it has a negative trend.

What I need to do is to detect all users that have a positive trend based on the last N measurements.

回答1:

One approach would be to perform a linear regression on the each user's rank, and check if the slope is positive or negative. Luckily, PostgreSQL has a builtin function to do that - regr_slope:

SELECT   user_id, regr_slope (rank1, timestamp1) AS slope
FROM     my_table
GROUP BY user_id

This query gives you the basic functionality. Now, you can dress it up a bit with case expressions if you like:

SELECT user_id, 
       CASE WHEN slope > 0 THEN 'positive' 
            WHEN slope < 0 THEN 'negative' 
            ELSE 'steady' END AS trend
FROM   (SELECT   user_id, regr_slope (rank1, timestamp1) AS slope
        FROM     my_table
        GROUP BY user_id) t

Edit:
Unfortunately, regr_slope doesn't have a built in way to handle "top N" type requirements, so this should be handled separately, e.g., by a subquery with row_number:

-- Decoration outer query
SELECT user_id, 
       CASE WHEN slope > 0 THEN 'positive' 
            WHEN slope < 0 THEN 'negative' 
            ELSE 'steady' END AS trend
FROM   (-- Inner query to calculate the slope
        SELECT   user_id, regr_slope (rank1, timestamp1) AS slope
        FROM     (-- Inner query to get top N
                  SELECT user_id, rank1, 
                         ROW_NUMER() OVER (PARTITION BY user_id 
                                           ORDER BY timestamp1 DESC) AS rn
                  FROM   my_table) t
        WHERE    rn <= N -- Replace N with the number of rows you need
        GROUP BY user_id) t2

回答2:

You can use analytic functions for this. Overall approach:

compute the previous rank using lag()
use case to decide whether the trend is positive or not (0 or 1)
use min() to get the minimum trend over the preceding N rows; if the trend was positive for N rows, this returns 1, otherwise 0. To limit it to N rows, use the between N preceding and 0 following clause of the windowing function

Code:

select v2.*,
  min(positive_trend) over (partition by userid order by timestamp1
                             rows between 3 preceding and 0 following) as trend_overall
from (
  select v1.*,
    (case when prev_rank < rank1 then 0 else 1 end) as positive_trend
  from (
    select userid,
      rank1,
      timestamp1,
      lag(rank1) over (partition by userid order by timestamp1) as prev_rank
    from t1
    order by userid, timestamp1
  ) v1
) v2

SQL Fiddle

UPDATE

To only get the userid with the overall trend and the delta for the rank, you'll have to add another call to lag(.., N+1) to get the nth previous rank and row_number() to get a numbering within the same userid:

select v3.userid, v3.trend_overall, delta_rank
from (  
  select v2.*,
    min(positive_trend) over (partition by userid order by timestamp1
                               rows between 3 preceding and 0 following) as trend_overall,
    latest_rank - prev_N_rank as delta_rank
  from (
    select v1.*,
      (case when prev_rank < rank1 then 0 else 1 end) as positive_trend,
      max(case when v1.rn = 1 then rank1 else NULL end) over (partition by userid) as latest_rank
    from (
      select userid,
        rank1,
        timestamp1,
        lag(rank1) over (partition by userid order by timestamp1) as prev_rank,
        lag(rank1, 4) over (partition by userid order by timestamp1) as prev_N_rank,
        row_number() over (partition by userid order by timestamp1 desc) as rn
      from t1
      order by userid, timestamp1
    ) v1
  ) v2
) v3 
where rn = 1
group by userid, trend_overall, delta_rank
order by userid, trend_overall, delta_rank

Updated SQL Fiddle

来源：https://stackoverflow.com/questions/22039054/aggregate-function-to-detect-trend-in-postgresql

标签

sql

database

postgresql

select

aggregate-functions