I have a situation that I\'m sure is quite common and it\'s really bothering me that I can\'t figure out how to do it or what to search for to find a relevant example/soluti
If you don't mind creating a temporary table*, then I think the following should work well. I have tested it with 120,000 records, and the whole process completes in under 6 seconds. With 1,048,576 records it completed in just under 66 seconds - and that's on an old Pentium III with 128MB RAM:
*In MySQL 5.0 (and perhaps other versions) the temporary table cannot be a true MySQL temporary table, as you cannot refer to a TEMPORARY table more than once in the same query. See here:
http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html
Instead, just drop/create a normal table, as follows:
DROP TABLE IF EXISTS `tmp_log`;
CREATE TABLE `tmp_log` (
`id` INT NOT NULL,
`row` INT NOT NULL,
`name` VARCHAR(16),
`ts` DATETIME NOT NULL,
`eventtype` VARCHAR(25),
INDEX `row` (`row` ASC),
INDEX `eventtype` (`eventtype` ASC)
);
This table is used to store a sorted and numbered list of rows from the following SELECT query:
INSERT INTO `tmp_log` (
`id`,
`row`,
`name`,
`ts`,
`eventtype`
)
SELECT
`id`,
@row:=@row+1,
`name`,
`ts`,
`eventtype`
FROM log,
(SELECT @row:=0) row_count
ORDER BY `name`, `id`;
The above SELECT query sorts the rows by name and then id (you could use the timestamp instead of the id, just so long as the start events appear before the stop events). Each row is also numbered. By doing this, matching pairs of events are always next to each other, and the row number of the start event is always one less than the row id of the stop event.
Now select the matching pairs from the list:
SELECT
start_log.row AS start_row,
stop_log.row AS stop_row,
start_log.name AS name,
start_log.eventtype AS start_event,
start_log.ts AS start_time,
stop_log.eventtype AS stop_event,
stop_log.ts AS end_time,
TIMEDIFF(stop_log.ts, start_log.ts) AS duration
FROM
tmp_log AS start_log
INNER JOIN tmp_log AS stop_log
ON start_log.row+1 = stop_log.row
AND start_log.name = stop_log.name
AND start_log.eventtype = 'start'
AND stop_log.eventtype = 'stop'
ORDER BY start_log.id;
Once you're done, it's probably a good idea to drop the temporary table:
DROP TABLE IF EXISTS `tmp_log`;row
UPDATE
You could try the following idea, which eliminates temp tables and joins altogether by using variables to store values from the previous row. It sorts the rows by name then time stamp, which groups all values with the same name together, and puts each group in time order. I think that this should ensure that all corresponding start/stop events are next to each other.
SELECT id, name, start, stop, TIMEDIFF(stop, start) AS duration FROM (
SELECT
id, ts, eventtype,
(@name <> name) AS new_name,
@start AS start,
@start := IF(eventtype = 'start', ts, NULL) AS prev_start,
@stop := IF(eventtype = 'stop', ts, NULL) AS stop,
@name := name AS name
FROM table1 ORDER BY name, ts
) AS tmp, (SELECT @start:=NULL, @stop:=NULL, @name:=NULL) AS vars
WHERE new_name = 0 AND start IS NOT NULL AND stop IS NOT NULL;
I don't know how it will compare to Ivar Bonsaksen's method, but it runs fairly fast on my box.
Here's how I created the test data:
CREATE TABLE `table1` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(5),
`ts` DATETIME,
`eventtype` VARCHAR(5),
PRIMARY KEY (`id`),
INDEX `name` (`name`),
INDEX `ts` (`ts`)
) ENGINE=InnoDB;
DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < 1000000 DO
INSERT INTO table1 (name, ts, eventtype) VALUES (
CHAR(FLOOR(65 + RAND() * 26)),
DATE_ADD(NOW(),
INTERVAL FLOOR(RAND() * 365) DAY),
IF(RAND() >= 0.5, 'start', 'stop')
);
SET i = i + 1;
END WHILE;
END;
//
DELIMITER ;
CALL autofill();