I have a situation that I\'m sure is quite common and it\'s really bothering me that I can\'t figure out how to do it or what to search for to find a relevant example/soluti
Try this.
select start.name, start.ts start, end.ts end, timediff(end.ts, start.ts) duration from (
select *, (
select id from log L2 where L2.ts>L1.ts and L2.name=L1.name order by ts limit 1
) stop_id from log L1
) start join log end on end.id=start.stop_id
where start.eventtype='start' and end.eventtype='stop';
Can you change the data collector? If yes, add a group_id field (with an index) into the log table and write the id of the start event into it (same id for start and end in the group_id). Then you can do
SELECT S.id, S.name, TIMEDIFF(E.ts, S.ts) `diff`
FROM `log` S
JOIN `log` E ON S.id = E.group_id AND E.eventtype = 'end'
WHERE S.eventtype = 'start'
If you don't mind creating a temporary table*, then I think the following should work well. I have tested it with 120,000 records, and the whole process completes in under 6 seconds. With 1,048,576 records it completed in just under 66 seconds - and that's on an old Pentium III with 128MB RAM:
*In MySQL 5.0 (and perhaps other versions) the temporary table cannot be a true MySQL temporary table, as you cannot refer to a TEMPORARY table more than once in the same query. See here:
http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html
Instead, just drop/create a normal table, as follows:
DROP TABLE IF EXISTS `tmp_log`;
CREATE TABLE `tmp_log` (
`id` INT NOT NULL,
`row` INT NOT NULL,
`name` VARCHAR(16),
`ts` DATETIME NOT NULL,
`eventtype` VARCHAR(25),
INDEX `row` (`row` ASC),
INDEX `eventtype` (`eventtype` ASC)
);
This table is used to store a sorted and numbered list of rows from the following SELECT query:
INSERT INTO `tmp_log` (
`id`,
`row`,
`name`,
`ts`,
`eventtype`
)
SELECT
`id`,
@row:=@row+1,
`name`,
`ts`,
`eventtype`
FROM log,
(SELECT @row:=0) row_count
ORDER BY `name`, `id`;
The above SELECT query sorts the rows by name and then id (you could use the timestamp instead of the id, just so long as the start events appear before the stop events). Each row is also numbered. By doing this, matching pairs of events are always next to each other, and the row number of the start event is always one less than the row id of the stop event.
Now select the matching pairs from the list:
SELECT
start_log.row AS start_row,
stop_log.row AS stop_row,
start_log.name AS name,
start_log.eventtype AS start_event,
start_log.ts AS start_time,
stop_log.eventtype AS stop_event,
stop_log.ts AS end_time,
TIMEDIFF(stop_log.ts, start_log.ts) AS duration
FROM
tmp_log AS start_log
INNER JOIN tmp_log AS stop_log
ON start_log.row+1 = stop_log.row
AND start_log.name = stop_log.name
AND start_log.eventtype = 'start'
AND stop_log.eventtype = 'stop'
ORDER BY start_log.id;
Once you're done, it's probably a good idea to drop the temporary table:
DROP TABLE IF EXISTS `tmp_log`;row
UPDATE
You could try the following idea, which eliminates temp tables and joins altogether by using variables to store values from the previous row. It sorts the rows by name then time stamp, which groups all values with the same name together, and puts each group in time order. I think that this should ensure that all corresponding start/stop events are next to each other.
SELECT id, name, start, stop, TIMEDIFF(stop, start) AS duration FROM (
SELECT
id, ts, eventtype,
(@name <> name) AS new_name,
@start AS start,
@start := IF(eventtype = 'start', ts, NULL) AS prev_start,
@stop := IF(eventtype = 'stop', ts, NULL) AS stop,
@name := name AS name
FROM table1 ORDER BY name, ts
) AS tmp, (SELECT @start:=NULL, @stop:=NULL, @name:=NULL) AS vars
WHERE new_name = 0 AND start IS NOT NULL AND stop IS NOT NULL;
I don't know how it will compare to Ivar Bonsaksen's method, but it runs fairly fast on my box.
Here's how I created the test data:
CREATE TABLE `table1` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(5),
`ts` DATETIME,
`eventtype` VARCHAR(5),
PRIMARY KEY (`id`),
INDEX `name` (`name`),
INDEX `ts` (`ts`)
) ENGINE=InnoDB;
DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < 1000000 DO
INSERT INTO table1 (name, ts, eventtype) VALUES (
CHAR(FLOOR(65 + RAND() * 26)),
DATE_ADD(NOW(),
INTERVAL FLOOR(RAND() * 365) DAY),
IF(RAND() >= 0.5, 'start', 'stop')
);
SET i = i + 1;
END WHILE;
END;
//
DELIMITER ;
CALL autofill();
I believe this could be a simpler way to reach your goal:
SELECT
start_log.name,
MAX(start_log.ts) AS start_time,
end_log.ts AS end_time,
TIMEDIFF(MAX(start_log.ts), end_log.ts)
FROM
log AS start_log
INNER JOIN
log AS end_log ON (
start_log.name = end_log.name
AND
end_log.ts > start_log.ts)
WHERE start_log.eventtype = 'start'
AND end_log.eventtype = 'stop'
GROUP BY start_log.name
It should run considerably faster as it eliminates one subquery.
I got it working by combining both your solutions, but the query isn't very effective and I'd think there would be a smarter way to omit those unwanted rows.
What I've got now is:
SELECT y.name,
y.start,
y.stop,
TIMEDIFF(y.stop, y.start)
FROM (SELECT l.name,
MAX(x.ts) AS start,
l.ts AS stop
FROM log l
JOIN (SELECT t.name,
t.ts
FROM log t
WHERE t.eventtype = 'start') x ON x.name = l.name
AND x.ts < l.ts
WHERE l.eventtype = 'stop'
GROUP BY l.name, l.ts) y
WHERE NOT EXISTS (SELECT 1
FROM log AS d
WHERE d.ts > y.start AND d.ts < y.stop AND d.name = y.name
AND d.eventtype = 'stop')
Limited to a given 'name' the query goes from about 0.5 seconds to about 14 seconds when I include the WHERE NOT EXISTS
clause... The table will become quite large and I'm worried about how many hours this will take for all names in the end. I currently only have data for June 2010 in the table (10 days) and it's now at 109888 rows.
How about this:
SELECT start_log.ts AS start_time, end_log.ts AS end_time
FROM log AS start_log
INNER JOIN log AS end_log ON (start_log.name = end_log.name AND end_log.ts > start_log.ts)
WHERE NOT EXISTS (SELECT 1 FROM log WHERE log.ts > start_log.ts AND log.ts < end_log.ts)
AND start_log.eventtype = 'start'
AND end_log.eventtype = 'stop'
This will find each pair of rows (aliased as start_log
and end_log
) with no events in between, where the first is always a start and the last is always a stop. Since we disallow intermediate events, a start that's not immediately followed by a stop will naturally be excluded.