Use google bigquery to build histogram graph

前端 未结 7 1090
再見小時候
再見小時候 2020-12-17 22:11

How can write a query that makes histogram graph rendering easier?

For example, we have 100 million people with ages, we want to draw the histogram/buckets for age 0

相关标签:
7条回答
  • 2020-12-17 22:26

    With #standardSQL and an auxiliary stats query, we can define the range the histogram should look into.

    Here for the time to fly between SFO and JFK - with 10 buckets:

    WITH data AS ( 
        SELECT *, ActualElapsedTime datapoint
        FROM `fh-bigquery.flights.ontime_201903`
        WHERE FlightDate_year = "2018-01-01" 
        AND Origin = 'SFO' AND Dest = 'JFK'
    )
    , stats AS (
      SELECT min+step*i min, min+step*(i+1)max
      FROM (
        SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
        FROM (
          SELECT MIN(datapoint) min, MAX(datapoint) max
          FROM data
        )
      ), UNNEST(i) i
    )
    
    SELECT COUNT(*) count, (min+max)/2 avg
    FROM data 
    JOIN stats
    ON data.datapoint >= stats.min AND data.datapoint<stats.max
    GROUP BY avg
    ORDER BY avg
    

    If you need round numbers, see: https://stackoverflow.com/a/60159876/132438

    0 讨论(0)
  • 2020-12-17 22:32

    Write a subquery like this:

    (SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)
    

    Then you can do something like this:

    SELECT * FROM
    (SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
    (SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
    (SELECT '3' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)
    

    Result will be like:

    Row agegroup count 
    1   1       somenumber
    2   2       somenumber
    3   3       another number
    

    I hope this helps you. Of course in the age group you can write anything like: '0 to 10'

    0 讨论(0)
  • 2020-12-17 22:33

    See the 2019 update, with #standardSQL --Fh


    The subquery idea works, as does "CASE WHEN" and then doing a group by:

    SELECT SUM(field1), bucket 
    FROM (
        SELECT field1, CASE WHEN age >=  0 AND age < 10 THEN 1
                            WHEN age >= 10 AND age < 20 THEN 2
                            WHEN age >= 20 AND age < 30 THEN 3
                            ...
                            ELSE -1 END as bucket
        FROM table1) 
    GROUP BY bucket
    

    Alternately, if the buckets are regular -- you could just divide and cast to an integer:

    SELECT SUM(field1), bucket 
    FROM (
        SELECT field1, INTEGER(age / 10) as bucket FROM table1)
    GROUP BY bucket
    
    0 讨论(0)
  • 2020-12-17 22:46

    Using a cross join to get your min and max values (not that expensive on a single tuple) you can get a normalized bucket list of any given bucket count:

    select
      min(data.VAL) as min,
      max(data.VAL) as max,
      count(data.VAL) as num,
      integer((data.VAL-value.min)/(value.max-value.min)*8) as group
    from [table] data
    CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
    GROUP BY group
    ORDER BY group 
    

    in this example we're getting 8 buckets (pretty self explanatory) plus one for null VAL

    0 讨论(0)
  • 2020-12-17 22:48

    There is now the APPROX_QUANTILES aggregation function in standard SQL.

    SELECT
        APPROX_QUANTILES(column, number_of_bins)
    ...    
    
    0 讨论(0)
  • 2020-12-17 22:52

    You're looking for a single vector of information. I would normally query it like this:

    select
      count(*) as num,
      integer( age / 10 ) as age_group
    from mytable
    group by age_group 
    

    A big case statement will be needed for arbitrary groups. It would be simple but much longer. My example should be fine if every bucket contains N years.

    0 讨论(0)
提交回复
热议问题