问题
I have a Google BigQuery Table which contains all the versions of resources. Every time a resource is created/updated/deleted a new row is added incrementing the version number (this number will be the timestamp of when the row is added)
+-------+------------+--------+-------+-------------+
|  ID   | ResourceID | Action | Count |  Timestamp  |
+-------+------------+--------+-------+-------------+
| ABC_1 | ABC        | CREATE |    10 | {timestamp} |
| ABC_2 | ABC        | UPDATE |     8 | {timestamp} |
| ABC_3 | ABC        | UPDATE |     4 | {timestamp} |
| ABC_4 | ABC        | DELETE |     4 | {timestamp} |
| -     |            |        |       |             |
| DEF_1 | DEF        | CREATE |    10 | {timestamp} |
| DEF_2 | DEF        | DELETE |    10 | {timestamp} |
| -     |            |        |       |             |
| GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
| -     |            |        |       |             |
| KLM_1 | KLM        | CREATE |    10 | {timestamp} |
| KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
+-------+------------+--------+-------+-------------+
- ID: a unique ID of the row, which contains the ResourceID plus the version identifier
 - ResourceID: the ID of the resource where an action occured
 - Action: The action occured on the resource
 - Count: The value associated to the resource
 - Timestamp: The timestamp of when the row has been added (which is the same attached to the unique ID)
 
I need a compose a query which retrieve all the last versions of each resource
+-------+------------+--------+-------+-------------+
|  ID   | ResourceID | Action | Count |  Timestamp  |
+-------+------------+--------+-------+-------------+
| ABC_4 | ABC        | DELETE |     4 | {timestamp} |
| DEF_2 | DEF        | DELETE |    10 | {timestamp} |
| GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
| KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
+-------+------------+--------+-------+-------------+
In addition, all the resource which are in DELETE status, need to be ignored.
So here is the final output I'm looking for
+-------+------------+--------+-------+-------------+
|  ID   | ResourceID | Action | Count |  Timestamp  |
+-------+------------+--------+-------+-------------+
| GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
| KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
+-------+------------+--------+-------+-------------+
This is the query I made
SELECT ResourceId, Count
FROM worklog_*
WHERE ID IN (
    SELECT max(ID)
    FROM worklog_*
    GROUP BY WorklogID
) AND Action != DELETE
It is not a true BigQuery query but it's enough to understand the behaviour.
This query works fine if the values of the ID column can be compared, this is why I choose to join ResourceId and Timestamp, the MAX() value will always provide the last status
Is this the best approach? Does anynone have a suggestion on a better way to do this kind of extraction?
回答1:
For BigQuery Standard SQL
#standardSQL
WITH worklog AS (
  SELECT 'ABC_1' AS ID, 'ABC' AS ResourceID, 'CREATE' AS Action, 10 AS COUNT UNION ALL
  SELECT 'ABC_2', 'ABC', 'UPDATE', 8 UNION ALL
  SELECT 'ABC_3', 'ABC', 'UPDATE', 4 UNION ALL
  SELECT 'ABC_4', 'ABC', 'DELETE', 4 UNION ALL
  SELECT 'DEF_1', 'DEF', 'CREATE', 10 UNION ALL
  SELECT 'DEF_2', 'DEF', 'DELETE', 10 UNION ALL
  SELECT 'GHJ_1', 'GHJ', 'CREATE', 10 UNION ALL
  SELECT 'KLM_1', 'KLM', 'CREATE', 10 UNION ALL
  SELECT 'KLM_2', 'KLM', 'UPDATE', 5 
)
SELECT * EXCEPT(Last)
FROM (
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY ResourceID ORDER BY ID DESC) AS Last
  FROM worklog
  WHERE Action != 'DELETE'
)
WHERE Last = 1
-- ORDER BY ID
    来源:https://stackoverflow.com/questions/45014847/google-bigquery-retrieve-last-version-of-each-row