How can write a query that makes histogram graph rendering easier?
For example, we have 100 million people with ages, we want to draw the histogram/buckets for age 0
With #standardSQL and an auxiliary stats
query, we can define the range the histogram should look into.
Here for the time to fly between SFO and JFK - with 10 buckets:
WITH data AS (
SELECT *, ActualElapsedTime datapoint
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate_year = "2018-01-01"
AND Origin = 'SFO' AND Dest = 'JFK'
)
, stats AS (
SELECT min+step*i min, min+step*(i+1)max
FROM (
SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
FROM (
SELECT MIN(datapoint) min, MAX(datapoint) max
FROM data
)
), UNNEST(i) i
)
SELECT COUNT(*) count, (min+max)/2 avg
FROM data
JOIN stats
ON data.datapoint >= stats.min AND data.datapoint<stats.max
GROUP BY avg
ORDER BY avg
If you need round numbers, see: https://stackoverflow.com/a/60159876/132438
Write a subquery like this:
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)
Then you can do something like this:
SELECT * FROM
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
(SELECT '3' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)
Result will be like:
Row agegroup count
1 1 somenumber
2 2 somenumber
3 3 another number
I hope this helps you. Of course in the age group you can write anything like: '0 to 10'
See the 2019 update, with #standardSQL --Fh
The subquery idea works, as does "CASE WHEN" and then doing a group by:
SELECT SUM(field1), bucket
FROM (
SELECT field1, CASE WHEN age >= 0 AND age < 10 THEN 1
WHEN age >= 10 AND age < 20 THEN 2
WHEN age >= 20 AND age < 30 THEN 3
...
ELSE -1 END as bucket
FROM table1)
GROUP BY bucket
Alternately, if the buckets are regular -- you could just divide and cast to an integer:
SELECT SUM(field1), bucket
FROM (
SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket
Using a cross join to get your min and max values (not that expensive on a single tuple) you can get a normalized bucket list of any given bucket count:
select
min(data.VAL) as min,
max(data.VAL) as max,
count(data.VAL) as num,
integer((data.VAL-value.min)/(value.max-value.min)*8) as group
from [table] data
CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
GROUP BY group
ORDER BY group
in this example we're getting 8 buckets (pretty self explanatory) plus one for null VAL
There is now the APPROX_QUANTILES aggregation function in standard SQL.
SELECT
APPROX_QUANTILES(column, number_of_bins)
...
You're looking for a single vector of information. I would normally query it like this:
select
count(*) as num,
integer( age / 10 ) as age_group
from mytable
group by age_group
A big case
statement will be needed for arbitrary groups. It would be simple but much longer. My example should be fine if every bucket contains N years.