Select first and last match by column from a timestamp-ordered table in MySQL

问题

Stackoverflow,

I need your help!

Say I have a table in MySQL that looks something like this:

-------------------------------------------------
 OWNER_ID | ENTRY_ID | VEHICLE | TIME | LOCATION
-------------------------------------------------
1|1|123456|2016-01-01 00:00:00|A
1|2|123456|2016-01-01 00:01:00|B
1|3|123456|2016-01-01 00:02:00|C
1|4|123456|2016-01-01 00:03:00|C
1|5|123456|2016-01-01 00:04:00|B
1|6|123456|2016-01-01 00:05:00|A
1|7|123456|2016-01-01 00:06:00|A
...
1|999|123456|2016-01-01 09:10:00|A
1|1000|123456|2016-01-01 09:11:00|A
1|1001|123456|2016-01-01 09:12:00|B
1|1002|123456|2016-01-01 09:13:00|C
1|1003|123456|2016-01-01 09:14:00|C
1|1004|123456|2016-01-01 09:15:00|B
...

Please note that the table schema is just made up so I can explain what I'm trying to accomplish...

Imagine that from ENTRY_ID 6 through 999, the LOCATION column is "A". All I need for my application is basically rows 1-6, then row 1000 onwards. Everything from row 7 to 999 is unnecessary data that doesn't need to be processed further. What I am struggling to do is either disregard those lines without having to move the processing of the data into my application, or better yet, delete them.

I'm scratching my head with this because:

1) I can't sort by LOCATION then just take the first and last entries, because the time order is important to my application and this will become lost - for example, if I processed this data in this way, I would end up with row 1 and row 1000, losing row 6.

2) I'd prefer to not move the processing of this data to my application, this data is superfluous to my requirements and there is simply no point keeping it if I can avoid it.

Given the above example data, what I want to end up with once I have a solution would be:

-------------------------------------------------
 OWNER_ID | ENTRY_ID | VEHICLE | TIME | LOCATION
-------------------------------------------------
1|1|123456|2016-01-01 00:00:00|A
1|2|123456|2016-01-01 00:01:00|B
1|3|123456|2016-01-01 00:02:00|C
1|4|123456|2016-01-01 00:03:00|C
1|5|123456|2016-01-01 00:04:00|B
1|6|123456|2016-01-01 00:05:00|A
1|1000|123456|2016-01-01 09:11:00|A
1|1001|123456|2016-01-01 09:12:00|B
1|1002|123456|2016-01-01 09:13:00|C
1|1003|123456|2016-01-01 09:14:00|C
1|1004|123456|2016-01-01 09:15:00|B
...

Hopefully I'm making sense here and not missing something obvious!

@Aliester - Is there a way to determine that a row doesn't need to be processed from the data contained within that row?

Unfortunately not.

@O. Jones - It sounds like you're hoping to determine the earliest and latest timestamp in your table for each distinct value of ENTRY_ID, and then retrieve the detail rows from the table matching those timestamps. Is that correct? Are your ENTRY_ID values unique? Are they guaranteed to be in ascending time order? Your query can be made cheaper if that is true. Please, if you have time, edit your question to clarify these points.

I'm trying to find the arrival time at a location, followed by the departure time from that location. Yes, ENTRY_ID is a unique field, but you cannot take it as a given that an earlier ENTRY_ID will equal an earlier timestamp - the incoming data is sent from a GPS unit on a vehicle and is NOT necessarily processed in the order they are sent due to network limitations.

回答1:

This is a tricky problem to solve in SQL because SQL is about sets of data, not sequences of data. It's extra tricky in MySQL because other SQL variants have a synthetic ROWNUM function and MySQL doesn't as of late 2016.

You need the union of two sets of data here.

the set of rows of your database immediately before, in time, a change in location.
the set of rows immediately after a change in location.

To get that, you need to start with a subquery that generates all your rows, ordered by VEHICLE then TIME, with row numbers. (http://sqlfiddle.com/#!9/6c3bc7/2/0) Please notice that the sample data in Sql Fiddle is different from your sample data.

       SELECT (@rowa := @rowa + 1) rownum,
               loc.*
          FROM loc
          JOIN (SELECT @rowa := 0) init
         ORDER BY VEHICLE, TIME

Then you need to self-join that subquery, use the ON clause to exclude consecutive rows at the same location, and take the rows right before a change in location. Comparing consecutive rows is done by ON ... b.rownum = a.rownum+1. That is this query. (http://sqlfiddle.com/#!9/6c3bc7/1/0)

SELECT a.*
FROM (
            SELECT (@rowa := @rowa + 1) rownum,
                   loc.*
              FROM loc
              JOIN (SELECT @rowa := 0) init
             ORDER BY VEHICLE, TIME
) a 
 JOIN (
             SELECT (@rowb := @rowb + 1) rownum,
                   loc.*
              FROM loc
              JOIN (SELECT @rowb := 0) init
             ORDER BY VEHICLE, TIME
 ) b   ON a.VEHICLE = b.VEHICLE
      AND b.rownum = a.rownum + 1
      AND a.location <> b.location

A variant of this subquery, where you say SELECT b.*, gets the rows right after a change in location (http://sqlfiddle.com/#!9/6c3bc7/3/0)

Finally, you take the setwise UNION of those two queries, order it appropriately, and you have your set of rows with the duplicate consecutive positions removed. Please notice that this gets quite verbose in MySQL because the nasty @rowa := @rowa + 1 hack used to generate row numbers has to use a different variable (@rowa, @rowb, etc) in each copy of the subquery. (http://sqlfiddle.com/#!9/6c3bc7/4/0)

SELECT a.*
  FROM (
        SELECT (@rowa := @rowa + 1) rownum,
               loc.*
          FROM loc
          JOIN (SELECT @rowa := 0) init
         ORDER BY VEHICLE, TIME
) a 
 JOIN (
         SELECT (@rowb := @rowb + 1) rownum,
               loc.*
          FROM loc
          JOIN (SELECT @rowb := 0) init
         ORDER BY VEHICLE, TIME
 ) b ON a.VEHICLE = b.VEHICLE AND b.rownum = a.rownum + 1  AND a.location <> b.location

 UNION

 SELECT d.*
  FROM (
        SELECT (@rowc := @rowc + 1) rownum,
               loc.*
          FROM loc
          JOIN (SELECT @rowc := 0) init
         ORDER BY VEHICLE, TIME
) c 
 JOIN (
         SELECT (@rowd := @rowd + 1) rownum,
               loc.*
          FROM loc
          JOIN (SELECT @rowd := 0) init
         ORDER BY VEHICLE, TIME
 ) d ON c.VEHICLE = d.VEHICLE AND c.rownum = d.rownum - 1  AND c.location <> d.location
 order by VEHICLE, TIME

And, in next-generation MySQL, available in beta now in MariaDB 10.2, this is much much easier. The new generation as common table expressions and row numbering.

 with loc as
     (
            SELECT  ROW_NUMBER() OVER (PARTITION BY VEHICLE ORDER BY time) rownum,
                   loc.*
              FROM loc
)

select a.* 
 from loc a
 join loc b ON a.VEHICLE = b.VEHICLE
           AND b.rownum = a.rownum + 1
           AND a.location <> b.location
 union 
select b.* 
 from loc a
 join loc b ON a.VEHICLE = b.VEHICLE
           AND b.rownum = a.rownum + 1
           AND a.location <> b.location
order by vehicle, time

来源：https://stackoverflow.com/questions/41166278/select-first-and-last-match-by-column-from-a-timestamp-ordered-table-in-mysql

标签

mysql

geolocation