问题
I'm using a psql DB to store a data structure like so:
datapoint(userId, rank, timestamp)
where timestamp is the Unix Epoch milliseconds timestamp.
In this structure I store the rank of each user each day, so it's like:
UserId Rank Timestamp
1 1 1435366459
1 2 1435366458
1 3 1435366457
2 8 1435366456
2 6 1435366455
2 7 1435366454
So, in the sample data above, userId 1 its improving it's rank with each measurement, which means it has a positive trend, while userId 2 is dropping in rank, which means it has a negative trend.
What I need to do is to detect all users that have a positive trend based on the last N measurements.
回答1:
One approach would be to perform a linear regression on the each user's rank, and check if the slope is positive or negative. Luckily, PostgreSQL has a builtin function to do that - regr_slope:
SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM my_table
GROUP BY user_id
This query gives you the basic functionality. Now, you can dress it up a bit with case expressions if you like:
SELECT user_id,
CASE WHEN slope > 0 THEN 'positive'
WHEN slope < 0 THEN 'negative'
ELSE 'steady' END AS trend
FROM (SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM my_table
GROUP BY user_id) t
Edit:
Unfortunately, regr_slope doesn't have a built in way to handle "top N" type requirements, so this should be handled separately, e.g., by a subquery with row_number:
-- Decoration outer query
SELECT user_id,
CASE WHEN slope > 0 THEN 'positive'
WHEN slope < 0 THEN 'negative'
ELSE 'steady' END AS trend
FROM (-- Inner query to calculate the slope
SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM (-- Inner query to get top N
SELECT user_id, rank1,
ROW_NUMER() OVER (PARTITION BY user_id
ORDER BY timestamp1 DESC) AS rn
FROM my_table) t
WHERE rn <= N -- Replace N with the number of rows you need
GROUP BY user_id) t2
回答2:
You can use analytic functions for this. Overall approach:
- compute the previous rank using lag()
- use case to decide whether the trend is positive or not (0 or 1)
- use min() to get the minimum trend over the preceding N rows; if the trend was positive for N rows, this returns 1, otherwise 0. To limit it to N rows, use the
between N preceding and 0 followingclause of the windowing function
Code:
select v2.*,
min(positive_trend) over (partition by userid order by timestamp1
rows between 3 preceding and 0 following) as trend_overall
from (
select v1.*,
(case when prev_rank < rank1 then 0 else 1 end) as positive_trend
from (
select userid,
rank1,
timestamp1,
lag(rank1) over (partition by userid order by timestamp1) as prev_rank
from t1
order by userid, timestamp1
) v1
) v2
SQL Fiddle
UPDATE
To only get the userid with the overall trend and the delta for the rank, you'll have to add another call to lag(.., N+1) to get the nth previous rank and row_number() to get a numbering within the same userid:
select v3.userid, v3.trend_overall, delta_rank
from (
select v2.*,
min(positive_trend) over (partition by userid order by timestamp1
rows between 3 preceding and 0 following) as trend_overall,
latest_rank - prev_N_rank as delta_rank
from (
select v1.*,
(case when prev_rank < rank1 then 0 else 1 end) as positive_trend,
max(case when v1.rn = 1 then rank1 else NULL end) over (partition by userid) as latest_rank
from (
select userid,
rank1,
timestamp1,
lag(rank1) over (partition by userid order by timestamp1) as prev_rank,
lag(rank1, 4) over (partition by userid order by timestamp1) as prev_N_rank,
row_number() over (partition by userid order by timestamp1 desc) as rn
from t1
order by userid, timestamp1
) v1
) v2
) v3
where rn = 1
group by userid, trend_overall, delta_rank
order by userid, trend_overall, delta_rank
Updated SQL Fiddle
来源:https://stackoverflow.com/questions/22039054/aggregate-function-to-detect-trend-in-postgresql