Stratified random sampling with BigQuery?

前端 未结 2 1220

How can I do stratified sampling on BigQuery?

For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 cat

2条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-08 18:02

    I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.

    This looks like:

    select t.*
    from (select t.*,
                 row_number() over (order by category order by rand()) as seqnum
          from t
         ) t
    where seqnum % 10 = 1;
    

    Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.

    If you want equal sized samples, then order within each category and just take a fixed number:

    select t.*
    from (select t.*,
                 row_number() over (partition by category order by rand()) as seqnum
          from t
         ) t
    where seqnum <= 100;
    

    Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.

    Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.

提交回复
热议问题