问题
In Google BigQuery, it is possible to retrieve rows of a table (snapshot) in the past (at least in the last 7 days) :
With Legacy SQL, we can use snapshot decorators :
#legacySQL
SELECT * FROM [PROJECT_ID:DATASET.TABLE@-3600000]
With Standard SQL, we can use FOR SYSTEM_TIME AS OF in FROM clause :
#standardSQL
SELECT *
FROM `PROJECT_ID.DATASET.TABLE`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
Both examples return snapshots of PROJECT_ID.DATASET.TABLE
one hour ago.
But I'm wondering if there is any guarantee of retrieving table data in the past. A colleague told me that he read somewhere (but he can't find it anymore) that this was a "best effort" feature, so potentially there may be some missing data.
Is this feature usable in production environments for data recovery (for example if someone inadvertently truncates an important table), as long as the recovery in done within the 7 days after the mistake? Is there any guarantee that we can access the whole data stored at a particular time?
Update
As @Pentium10 correctly pointed out in a comment, recovering old data after doing CREATE OR REPLACE
jobs on a table is not possible. After some tries, I will even add that executing jobs with one of these statement types :
CREATE_TABLE
(CREATE OR REPLACE
)CREATE_TABLE_AS_SELECT
DROP_TABLE
completely removes the ability to retrieve data back in time for that particular table.
But, supposing that we only use the following statement types to modify the table data :
INSERT
UPDATE
DELETE
MERGE
Is there a guarantee that the snapshot data at t is exactly the data contained in the table at t? Or is this a "best effort" feature?
回答1:
The FOR SYSTEM_TIME AS OF
syntax is useful for querying a table at multiple points in time, but I recommend using the BigQuery CLI copy command with the @<time>
decorator when you need to recover or roll back a table. (See the CLI example here for a complete reference.) To do this, first determine the target recovery time in epoch milliseconds. Next, run the copy command from your local machine or the Google Cloud Shell; append the epoch time to the table name as follows.
bq cp test_data_set.weather_data@1588643412000 test_data_set.recovered_weather_data
Note that you cannot directly recover the table in place - you will need to copy the snapshot to a different table name and then copy back to the original table.
bq cp test_data_set.recovered_weather_data test_data_set.weather_data
The advantage of using the BigQuery CLI over running a query with FOR SYSTEM_TIME AS OF
is that it can recover or roll back a table even in case of schema changes or deletion. The copy command will work for any timestamp when the table existed, going back seven days. (The recovery window for deleted tables used to be two days, but it was extended recently.)
Regarding recovery SLA, it's helpful to understand the architecture of BigQuery. Column data blocks are stored as objects in the Colossus File System (Google Cloud Storage), and BigQuery executes updates with a copy on write strategy. In practice, this means that no backup process is required - BigQuery simply keeps old column data blocks and metadata (thus, table versions) around until they are garbage collected. BigQuery triggers the garbage collection process seven days after an update or deletion.
As mentioned above, rollback and recovery is advertised as a core capability of BigQuery. (See, for example, the heading Automatic backup and easy restore on the BigQuery landing page.) Thus, it appears to be subject to the standard service level agreement for other BigQuery features.
回答2:
Bigquery saves snapshots referring to older historical versions of the defined table. Therefore, as documented, you should have no problem recovering the data that was modified.
As @Pentium10 pointed out, you can't recover a deleted table when you have already created a new table with the same name, because of that you can't recover a table after doing "CREATE OR REPLACE" [1].
Snapshot decorator has the limitation that a snapshot can be restored within the last 7 days and has to be bigger or equal to the time of the table creation [2], so if the table was created recently by "CREATE OR REPLACE" or "CREATE" you can't select a timestamp before the table creation.
[1] https://cloud.google.com/bigquery/docs/managing-tables#undeletetable
[2] https://cloud.google.com/bigquery/table-decorators#snapshot-syntax
来源:https://stackoverflow.com/questions/59048115/bigquery-for-system-time-as-of-feature-guarantee-for-data-recovery