MySQL: group by consecutive days and count groups

前端 未结 5 1735
广开言路
广开言路 2020-12-17 20:43

I have a database table which holds each user\'s checkins in cities. I need to know how many days a user has been in a city, and then, how many visits a user has made to a c

相关标签:
5条回答
  • 2020-12-17 21:28

    As data sample provided by Devart, the inner "PreQuery" works with sql variables. By defaulting the @LUser to a -1 (probable non-existent user ID), the IF() test checks for any difference between last user and current. As soon as a new user, it gets a value of 1... Additionally, if the last date is more than 1 day from the new date of check-in, it gets a value of 1. Then, the subsequent columns reset the @LUser and @LDate to the value of the incoming record just tested against for the next cycle. Then, the outer query just sums them up and counts them for the final correct results per the Devart data set of

    User ID    Distinct Visits   Total Days
    1           3                 9
    2           1                 2
    
    select PreQuery.User_ID,
           sum( PreQuery.NextVisit ) as DistinctVisits,
           count(*) as TotalDays
       from
          (  select v.user_id,
                   if( @LUser <> v.User_ID OR @LDate < ( date( v.dt ) - Interval 1 day ), 1, 0 ) as NextVisit,
                   @LUser := v.user_id,
                   @LDate := date( v.dt )
                from 
                   Visits v,
                   ( select @LUser := -1, @LDate := date(now()) ) AtVars 
                order by
                   v.user_id,
                   v.dt  ) PreQuery
        group by 
           PreQuery.User_ID
    
    0 讨论(0)
  • 2020-12-17 21:28

    for a first sub-task:

    select count(*) 
    from (
    select TO_DAYS(p.d)
    from p
    group by TO_DAYS(p.d)
    ) t
    
    0 讨论(0)
  • 2020-12-17 21:29

    Try to apply this code to your task -

    CREATE TABLE visits(
      user_id INT(11) NOT NULL,
      dt DATETIME DEFAULT NULL
    );
    
    INSERT INTO visits VALUES 
      (1, '2011-06-30 12:11:46'),
      (1, '2011-07-01 13:16:34'),
      (1, '2011-07-01 15:22:45'),
      (1, '2011-07-01 22:35:00'),
      (1, '2011-07-02 13:45:12'),
      (1, '2011-08-01 00:11:45'),
      (1, '2011-08-05 17:14:34'),
      (1, '2011-08-05 18:11:46'),
      (1, '2011-08-06 20:22:12'),
      (2, '2011-08-30 16:13:34'),
      (2, '2011-08-31 16:13:41');
    
    
    SET @i = 0;
    SET @last_dt = NULL;
    SET @last_user = NULL;
    
    SELECT v.user_id,
      COUNT(DISTINCT(DATE(dt))) number_of_days,
      MAX(days) number_of_visits
    FROM
      (SELECT user_id, dt
            @i := IF(@last_user IS NULL OR @last_user <> user_id, 1, IF(@last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(@last_dt), @i + 1, @i)) AS days,
            @last_dt := DATE(dt),
            @last_user := user_id
       FROM
         visits
       ORDER BY
         user_id, dt
      ) v
    GROUP BY
      v.user_id;
    
    ----------------
    Output:
    
    +---------+----------------+------------------+
    | user_id | number_of_days | number_of_visits |
    +---------+----------------+------------------+
    |       1 |              6 |                3 |
    |       2 |              2 |                1 |
    +---------+----------------+------------------+
    

    Explanation:

    To understand how it works let's check the subquery, here it is.

    SET @i = 0;
    SET @last_dt = NULL;
    SET @last_user = NULL;
    
    
    SELECT user_id, dt,
            @i := IF(@last_user IS NULL OR @last_user <> user_id, 1, IF(@last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(@last_dt), @i + 1, @i)) AS 
    
    days,
            @last_dt := DATE(dt) lt,
            @last_user := user_id lu
    FROM
      visits
    ORDER BY
      user_id, dt;
    

    As you see the query returns all rows and performs ranking for the number of visits. This is known ranking method based on variables, note that rows are ordered by user and date fields. This query calculates user visits, and outputs next data set where days column provides rank for the number of visits -

    +---------+---------------------+------+------------+----+
    | user_id | dt                  | days | lt         | lu |
    +---------+---------------------+------+------------+----+
    |       1 | 2011-06-30 12:11:46 |    1 | 2011-06-30 |  1 |
    |       1 | 2011-07-01 13:16:34 |    1 | 2011-07-01 |  1 |
    |       1 | 2011-07-01 15:22:45 |    1 | 2011-07-01 |  1 |
    |       1 | 2011-07-01 22:35:00 |    1 | 2011-07-01 |  1 |
    |       1 | 2011-07-02 13:45:12 |    1 | 2011-07-02 |  1 |
    |       1 | 2011-08-01 00:11:45 |    2 | 2011-08-01 |  1 |
    |       1 | 2011-08-05 17:14:34 |    3 | 2011-08-05 |  1 |
    |       1 | 2011-08-05 18:11:46 |    3 | 2011-08-05 |  1 |
    |       1 | 2011-08-06 20:22:12 |    3 | 2011-08-06 |  1 |
    |       2 | 2011-08-30 16:13:34 |    1 | 2011-08-30 |  2 |
    |       2 | 2011-08-31 16:13:41 |    1 | 2011-08-31 |  2 |
    +---------+---------------------+------+------------+----+
    

    Then we group this data set by user and use aggregate functions: 'COUNT(DISTINCT(DATE(dt)))' - counts the number of days 'MAX(days)' - the number of visits, it is a maximum value for the days field from our subquery.

    That is all;)

    0 讨论(0)
  • 2020-12-17 21:45

    You can find the first day of each visit by finding checkins where there was no checkin the day before.

    select count(distinct date(start_of_visit.datetime))
    from checkin start_of_visit
    left join checkin previous_day
        on start_of_visit.user = previous_day.user
        and start_of_visit.city = previous_day.city
        and date(start_of_visit.datetime) - interval 1 day = date(previous_day.datetime)
    where previous_day.id is null
    

    There are several important parts to this query.

    First, each checkin is joined to any checkin from the previous day. But since it's an outer join, if there was no checkin the previous day the right side of the join will have NULL results. The WHERE filtering happens after the join, so it keeps only those checkins from the left side where there are none from the right side. LEFT OUTER JOIN/WHERE IS NULL is really handy for finding where things aren't.

    Then it counts distinct checkin dates to make sure it doesn't double-count if the user checked in multiple times on the first day of the visit. (I actually added that part on edit, when I spotted the possible error.)

    Edit: I just re-read your proposed query for the first question. Your query would get you the number of checkins on a given date, instead of a count of dates. I think you want something like this instead:

    select count(distinct date(datetime))
    from checkin
    where user='some user' and city='some city'
    
    0 讨论(0)
  • 2020-12-17 21:45

    I think you should consider changing database structure. You could add table visits and visit_id into your checkins table. Each time you want to register new checkin you check if there is any checkin a day back. If yes then you add a new checkin with visit_id from yesterday's checkin. If not then you add new visit to visits and new checkin with new visit_id.

    Then you could get you data in one query with something like that: SELECT COUNT(id) AS number_of_days, COUNT(DISTINCT visit_id) number_of_visits FROM checkin GROUP BY user, city

    It's not very optimal but still better than doing anything with current structure and it will work. Also if results can be separate queries it will work very fast.

    But of course drawbacks are you will need to change database structure, do some more scripting and convert current data to new structure (i.e. you will need to add visit_id to current data).

    0 讨论(0)
提交回复
热议问题