How can I do stratified sampling on BigQuery?
For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 cat
With #standardSQL, let's define our table and some stats over it:
WITH table AS (
SELECT *, subreddit category
FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
SELECT *, SUM(c) OVER() total
FROM (
SELECT category, COUNT(*) c
FROM table
GROUP BY 1
HAVING c>1000000)
)
In this setup:
subreddit will be our categorySo, if we want 1% of each category in our sample:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 1/100
)
GROUP BY 2
Or let's say we want ~80,000 samples - but chosen proportionally through all categories:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 80000/total
)
GROUP BY 2
Now, if you want to get the ~same number of samples from each group (let's say, 20,000):
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 20000/c
)
GROUP BY 2
If you want exactly 20,000 elements from each category:
SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
)
If you want exactly 2% of each group:
SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2
If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT similar to the largest group size will make sure we don't sort more data than needed:
SELECT sample.*
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c