How to do repeatable sampling in BigQuery Standard SQL?

问题

In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data.

However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.

How would you re-write the blog's sampling code shown below, but using Standard SQL?

#legacySQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date)) % 10 < 8

回答1:

Standard SQL would re-write the query thus:

#standardSQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8

Specifically here are the changes:

a period (not colon) to separate the Google Cloud project from table name.
backticks (not square brackets) to escape hyphen in the table name.
MOD function (not %).
FARM_FINGERPRINT (not HASH). This is actually a different hashing function than Legacy SQL's HASH, which wasn't in fact consistent over time as the blog had implied.

回答2:

Based on accepted answer, provide a more general way to generate unique key for each row:

TO_JSON_STRING(STRUCT(col1, col2, ..., colN))

#standardSQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  ABS(MOD(FARM_FINGERPRINT(TO_JSON_STRING(STRUCT(date, airline, arrival_delay))), 10)) < 8

What if there is no unique key to identify each row?

Yes, it can happen that there are by design duplicated data rows in your dataset, with above query, either all of or none of the duplicates are included in sample set.

Depending on how big your dataset is, you can try to order the source dataset and using window function to generate a row_number for each row. Then do sampling based on row_number. This trick will work until you hit error sorting your dataset:

Resources exceeded during query execution: The query could not be executed in the allotted memory.

What if I do hit above error

Well, above way is simpler to implement but if you hit the limit, consider doing something more complex:

Generate a deduplicated table with a COUNT of how many times it appears in original dataset.
After hashing the row, increase the odds that a row is picked based on COUNT.
Since you don't want to use all COUNT number of duplicates, you may do a hash again to decide how big a portion of duplicated should be included in sample set. (There must be some better way mathematically though)

回答3:

Even less verbose versions I am using in my practice instead of lengthy TO_JSON_STRING(STRUCT(col1, col2, ..., colN))

TO_JSON_STRING(t)

and

FORMAT('%t', t)

as in below examples

#standardSQL
SELECT 
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights` t
WHERE
  MOD(ABS(FARM_FINGERPRINT(FORMAT('%t', t))), 10) < 8

and

#standardSQL
SELECT 
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights` t
WHERE
  MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(t))), 10) < 8

来源：https://stackoverflow.com/questions/46019624/how-to-do-repeatable-sampling-in-bigquery-standard-sql

标签

sql

hash

google-bigquery

cross-validation