How can I do stratified sampling on BigQuery?
For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 cat
I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.
This looks like:
select t.*
from (select t.*,
row_number() over (order by category order by rand()) as seqnum
from t
) t
where seqnum % 10 = 1;
Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.
If you want equal sized samples, then order within each category and just take a fixed number:
select t.*
from (select t.*,
row_number() over (partition by category order by rand()) as seqnum
from t
) t
where seqnum <= 100;
Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.
Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.