How to combine two rows and calculate the time difference between two timestamp values in MySQL?

前端 未结 6 1011
误落风尘
误落风尘 2020-12-06 06:14

I have a situation that I\'m sure is quite common and it\'s really bothering me that I can\'t figure out how to do it or what to search for to find a relevant example/soluti

相关标签:
6条回答
  • 2020-12-06 06:39

    Try this.

    select start.name, start.ts start, end.ts end, timediff(end.ts, start.ts) duration from (
        select *, (
            select id from log L2 where L2.ts>L1.ts and L2.name=L1.name order by ts limit 1
        ) stop_id from log L1
    ) start join log end on end.id=start.stop_id
    where start.eventtype='start' and end.eventtype='stop';
    
    0 讨论(0)
  • 2020-12-06 06:42

    Can you change the data collector? If yes, add a group_id field (with an index) into the log table and write the id of the start event into it (same id for start and end in the group_id). Then you can do

    SELECT S.id, S.name, TIMEDIFF(E.ts, S.ts) `diff`
    FROM `log` S
        JOIN `log` E ON S.id = E.group_id AND E.eventtype = 'end'
    WHERE S.eventtype = 'start'
    
    0 讨论(0)
  • 2020-12-06 06:49

    If you don't mind creating a temporary table*, then I think the following should work well. I have tested it with 120,000 records, and the whole process completes in under 6 seconds. With 1,048,576 records it completed in just under 66 seconds - and that's on an old Pentium III with 128MB RAM:

    *In MySQL 5.0 (and perhaps other versions) the temporary table cannot be a true MySQL temporary table, as you cannot refer to a TEMPORARY table more than once in the same query. See here:

    http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html

    Instead, just drop/create a normal table, as follows:

    DROP TABLE IF EXISTS `tmp_log`;
    CREATE TABLE `tmp_log` (
        `id` INT NOT NULL,
        `row` INT NOT NULL,
        `name` VARCHAR(16),
        `ts` DATETIME NOT NULL,
        `eventtype` VARCHAR(25),
        INDEX `row` (`row` ASC),
        INDEX `eventtype` (`eventtype` ASC)
    );
    

    This table is used to store a sorted and numbered list of rows from the following SELECT query:

    INSERT INTO `tmp_log` (
        `id`,
        `row`,
        `name`,
        `ts`,
        `eventtype`
    )
    SELECT
        `id`,
        @row:=@row+1,
        `name`,
        `ts`,
        `eventtype`
    FROM log,
    (SELECT @row:=0) row_count
    ORDER BY `name`, `id`;
    

    The above SELECT query sorts the rows by name and then id (you could use the timestamp instead of the id, just so long as the start events appear before the stop events). Each row is also numbered. By doing this, matching pairs of events are always next to each other, and the row number of the start event is always one less than the row id of the stop event.

    Now select the matching pairs from the list:

    SELECT
        start_log.row AS start_row,
        stop_log.row AS stop_row,
        start_log.name AS name,
        start_log.eventtype AS start_event,
        start_log.ts AS start_time,
        stop_log.eventtype AS stop_event,
        stop_log.ts AS end_time,
        TIMEDIFF(stop_log.ts, start_log.ts) AS duration
    FROM
        tmp_log AS start_log
    INNER JOIN tmp_log AS stop_log
        ON start_log.row+1 = stop_log.row
        AND start_log.name = stop_log.name
        AND start_log.eventtype = 'start'
        AND stop_log.eventtype = 'stop'
    ORDER BY start_log.id;
    

    Once you're done, it's probably a good idea to drop the temporary table:

    DROP TABLE IF EXISTS `tmp_log`;row
    

    UPDATE

    You could try the following idea, which eliminates temp tables and joins altogether by using variables to store values from the previous row. It sorts the rows by name then time stamp, which groups all values with the same name together, and puts each group in time order. I think that this should ensure that all corresponding start/stop events are next to each other.

    SELECT id, name, start, stop, TIMEDIFF(stop, start) AS duration FROM (
        SELECT
            id, ts, eventtype,
            (@name <> name) AS new_name,
            @start AS start,
            @start := IF(eventtype = 'start', ts, NULL) AS prev_start,
            @stop  := IF(eventtype = 'stop',  ts, NULL) AS stop,
            @name  := name AS name
        FROM table1 ORDER BY name, ts
    ) AS tmp, (SELECT @start:=NULL, @stop:=NULL, @name:=NULL) AS vars
    WHERE new_name = 0 AND start IS NOT NULL AND stop IS NOT NULL;
    

    I don't know how it will compare to Ivar Bonsaksen's method, but it runs fairly fast on my box.

    Here's how I created the test data:

    CREATE TABLE  `table1` (
        `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
        `name` VARCHAR(5),
        `ts` DATETIME,
        `eventtype` VARCHAR(5),
        PRIMARY KEY (`id`),
        INDEX `name` (`name`),
        INDEX `ts` (`ts`)
    ) ENGINE=InnoDB;
    
    DELIMITER //
    DROP PROCEDURE IF EXISTS autofill//
    CREATE PROCEDURE autofill()
    BEGIN
        DECLARE i INT DEFAULT 0;
        WHILE i < 1000000 DO
            INSERT INTO table1 (name, ts, eventtype) VALUES (
                CHAR(FLOOR(65 + RAND() * 26)),
                DATE_ADD(NOW(),
                INTERVAL FLOOR(RAND() * 365) DAY),
                IF(RAND() >= 0.5, 'start', 'stop')
            );
            SET i = i + 1;
        END WHILE;
    END;
    //
    DELIMITER ;
    
    CALL autofill();
    
    0 讨论(0)
  • 2020-12-06 06:52

    I believe this could be a simpler way to reach your goal:

    SELECT
        start_log.name,
        MAX(start_log.ts) AS start_time,
        end_log.ts AS end_time,
        TIMEDIFF(MAX(start_log.ts), end_log.ts)
    FROM
        log AS start_log
    INNER JOIN
        log AS end_log ON (
                start_log.name = end_log.name
            AND
                end_log.ts > start_log.ts)
    WHERE start_log.eventtype = 'start'
    AND end_log.eventtype = 'stop'
    GROUP BY start_log.name
    

    It should run considerably faster as it eliminates one subquery.

    0 讨论(0)
  • 2020-12-06 06:52

    I got it working by combining both your solutions, but the query isn't very effective and I'd think there would be a smarter way to omit those unwanted rows.

    What I've got now is:

    SELECT y.name, 
           y.start, 
           y.stop, 
           TIMEDIFF(y.stop, y.start) 
      FROM (SELECT l.name, 
                   MAX(x.ts) AS start, 
                   l.ts AS stop 
              FROM log l 
              JOIN (SELECT t.name, 
                           t.ts 
                      FROM log t 
                     WHERE t.eventtype = 'start') x ON x.name = l.name 
                           AND x.ts < l.ts 
             WHERE l.eventtype = 'stop' 
          GROUP BY l.name, l.ts) y 
    WHERE NOT EXISTS (SELECT 1 
                        FROM log AS d 
                       WHERE d.ts > y.start AND d.ts < y.stop AND d.name = y.name 
                             AND d.eventtype = 'stop')

    Limited to a given 'name' the query goes from about 0.5 seconds to about 14 seconds when I include the WHERE NOT EXISTS clause... The table will become quite large and I'm worried about how many hours this will take for all names in the end. I currently only have data for June 2010 in the table (10 days) and it's now at 109888 rows.

    0 讨论(0)
  • 2020-12-06 06:59

    How about this:

    SELECT start_log.ts AS start_time, end_log.ts AS end_time
    FROM log AS start_log
    INNER JOIN log AS end_log ON (start_log.name = end_log.name AND end_log.ts > start_log.ts)
    WHERE NOT EXISTS (SELECT 1 FROM log WHERE log.ts > start_log.ts AND log.ts < end_log.ts)
     AND start_log.eventtype = 'start'
     AND end_log.eventtype = 'stop'
    

    This will find each pair of rows (aliased as start_log and end_log) with no events in between, where the first is always a start and the last is always a stop. Since we disallow intermediate events, a start that's not immediately followed by a stop will naturally be excluded.

    0 讨论(0)
提交回复
热议问题