问题
I have a very large table where each row represents an abstraction called a Trip. Trips consist of numeric columns such as vehicle id, trip id, start time, stop time, distance traveled, driving duration, etc. So each Trip is a 1D vector of floating point values.
I want to transform this table, or list of vectors, into a list of Trip sequences where Trips are grouped into sequences by vehicle id and are in order according to start time. The sequence length needs to be limited to a specific size such as 256 but there can / should be multiple sequences with the same VehicleId.
Example:
(sequence length = 4)
[
(Vehicle1, [Trip1, Trip2, Trip3, Trip4]),
(Vehicle1, [Trip5, Trip6, Trip7]),
(Vehicle2, [Trip1, Trip2, Trip3, Trip4])
]
I'm trying to model driving patterns based on these Trips using a sequence-based model such as an LSTM / Transformer. Imagine each Trip as a word embedding and each sequence of trips as a sentence. Somehow I need to construct these sentences through a combination of BigQuery / Apache Beam functions (or any other recommended tools) since we're talking about hundreds of gigabytes of data. I'm fairly new to both tools so any help would be greatly appreciated.
回答1:
Below is for BigQuery Standard SQL
#standardSQL
SELECT trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips
FROM (
SELECT trip, DIV(ROW_NUMBER() OVER(PARTITION BY vehicle_id ORDER BY start_time) - 1, 4) grp
FROM `project.dataset.table` trip
)
GROUP BY trip.vehicle_id, grp
Above assumes ordering of trips by start_time and sequence length = 4
Also, it returns vehicle_id as a part of trip info in array - like in below example
Row vehicle_id trips.vehicle_id trips.trip_id trips.start_time trips.stop_time
1 Vehicle1 Vehicle1 Trip1 1 2
Vehicle1 Trip2 2 3
Vehicle1 Trip3 3 4
Vehicle1 Trip4 4 5
2 Vehicle1 Vehicle1 Trip5 5 6
Vehicle1 Trip6 6 6
Vehicle1 Trip7 7 6
3 Vehicle2 Vehicle2 Trip1 2 3
Vehicle2 Trip2 3 4
Vehicle2 Trip3 4 5
Vehicle2 Trip4 5 6
To eliminate this - try below
#standardSQL
SELECT vehicle_id,
ARRAY(
SELECT AS STRUCT * EXCEPT(vehicle_id)
FROM UNNEST(trips)
ORDER BY start_time
) trips
FROM (
SELECT trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips
FROM (
SELECT trip, DIV(ROW_NUMBER() OVER(PARTITION BY vehicle_id ORDER BY start_time) - 1, 4) grp
FROM `project.dataset.table` trip
)
GROUP BY trip.vehicle_id, grp
)
Row vehicle_id trips.trip_id trips.start_time trips.stop_time
1 Vehicle1 Trip1 1 2
Trip2 2 3
Trip3 3 4
Trip4 4 5
2 Vehicle1 Trip5 5 6
Trip6 6 6
Trip7 7 6
3 Vehicle2 Trip1 2 3
Trip2 3 4
Trip3 4 5
Trip4 5 6
来源:https://stackoverflow.com/questions/58699663/how-to-transform-an-sql-table-into-a-list-of-row-sequences-using-bigquery-and-ap