Random Sampling in Google BigQuery

前端未结

关注

 4  530

攒了一身酷 2020-11-29 21:48

I just discovered that the RAND() function, while undocumented, works in BigQuery. I was able to generate a (seemingly) random sample of 10 words from the Shakespeare datase

4条回答

野趣味 (楼主)

2020-11-29 22:32
For stratified sampling, check https://stackoverflow.com/a/52901452/132438

Good job finding it :). I requested the function recently, but it hasn't made it to documentation yet.

I would say the advantage of RAND() is that the results will vary, while HASH() will keep giving you the same results for the same values (not guaranteed over time, but you get the idea).

In case you want the variability that RAND() brings while still getting consistent results - you can seed it with an integer, as in RAND(3).

Notice though that the example you pasted is doing a full sort of the random values - for sufficiently big inputs this approach won't scale.

A scalable approach, to get around 10 random rows:
```
SELECT word
FROM [publicdata:samples.shakespeare]
WHERE RAND() < 10/164656
```
(where 10 is the approximate number of results I want to get, and 164656 the number of rows that table has)

standardSQL update:
```
#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/164656
```
or even:
```
#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/(SELECT COUNT(*) FROM `publicdata.samples.shakespeare`)
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

Random Sampling in Google BigQuery

standardSQL update: