Random Sampling in Google BigQuery

前端未结

关注

 4  531

攒了一身酷 2020-11-29 21:48

I just discovered that the RAND() function, while undocumented, works in BigQuery. I was able to generate a (seemingly) random sample of 10 words from the Shakespeare datase

4条回答

一生所求 (楼主)

2020-11-29 22:41
Once you calculate how much of total you need % wise you can...

Like mentioned before one way is to do non deterministic ( every time you run = different sample) with random such as for example if you want 0.1% of your total database sampled you would do :
```
SELECT *
FROM `dataset.table`
WHERE RAND() < 0.001 
```
You could actually make it deterministic by say saving this table so you can query it later, you could also select just one key column and save only that to be used in the future.

Another way that gets you the same repeatable random sample is to use cryptographic hashing function to generate a fingerprint of your (unique identifier field) column and then to select rows based on the two digits of the fingerprint. Below would label a random sample of 70% of total database. After which you can filter table on in_sample = True:
```
SELECT
*,
IF(MOD(ABS(FARM_FINGERPRINT(CAST(YOUR_COLUMN AS STRING))), 100) < 70,'True', 'False') 
AS in_sample
FROM (
     SELECT
     DISTINCT(YOUR_UNIQUE_IDENTIFIER_COLUMN) AS YOUR_COLUMN
     FROM
     `dataset.table`)
```
If you don't have a unique identifier column you could concatenate multiple columns to make one.

Similar way as above but with hashing function. Repeatable and gets you 70% of your sample. If you want other number just change 7 to your desire %:
```
SELECT
*
FROM
`dataset.table`
WHERE
ABS(HASH(YOUR_COLUMN)) % 10 < 7
```
Don't know about scalability of fingerprint vs hash so mentioned both, one may work better than other for some.

Best of luck,
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...