Create ranges based on the data

问题

I have products sold at different prices. I want to see how many products were sold in a particular price range. For this, I need to go through the data and see how to divide the ranges and then get the count of products in that range.

Data looks like below-

Product        Price sold
   A              4.5
   B              45.7
   C              20
   D              20.1
   E              36.8
   F              50

For example, for the above data I see that the min is 4.5 and max is 50. So, I decide to divide the price range like- 0-10$, 11-20$, 21-30$, 30-40$, 40-50$

So, the result should look like-

Range           No. of products sold
0-10                    1
11-20                   2
21-30                   0
30-40                   1
40-50                   2

The prices are in float so, the ranges should take care of the float values. Is this possible?

回答1:

Below is for BigQuery Standard SQL

#standardSQL
WITH price_ranges AS (
  SELECT '0-10' price_range UNION ALL 
  SELECT '11-20' UNION ALL 
  SELECT '21-30' UNION ALL 
  SELECT '30-40' UNION ALL 
  SELECT '40-50' 
)
SELECT price_range, COUNT(1) number_sold
FROM `project.dataset.table`
JOIN price_ranges 
ON CAST(price_sold AS INT64)
  BETWEEN CAST(SPLIT(price_range, '-')[OFFSET(0)] AS INT64) 
  AND CAST(SPLIT(price_range, '-')[OFFSET(1)] AS INT64)
GROUP BY price_range
-- ORDER BY price_range

If to apply to sample data from your question - result is

Row price_range number_sold  
1   0-10        1    
2   11-20       2    
3   30-40       1    
4   40-50       2

回答2:

You can use generate_array(). I would phrase this as:

select lb, lb + 10 as ub, count(d.product)
from unnest(generate_array(0, 50, 10)) lb left join
     data d
     on d.price >= lb and
        d.price < lb + 10
group by lb
order by lb;

You can concatenate the lower bound and upper bound together, but it seems just a useful to keep them in two columns.

回答3:

None of the current answers seem to answer the question: "How do I generate the range" (as both answers assume a range 0-50).

What you seem to want is an histogram, and you can find that answer here:

https://stackoverflow.com/a/56447260/132438

Now, if you want round steps between each bucket:


WITH data AS (
  SELECT * FROM  `fh-bigquery.public_dump.gdp_capita`
), min_and_max AS (
  SELECT MIN(gdp_capita) min, MAX(gdp_capita) max
  FROM data
), generate_buckets AS (
  SELECT x bucket_min
    , IFNULL(LEAD(x) OVER(ORDER BY x), 1+(SELECT max FROM min_and_max)) bucket_max 
  FROM UNNEST(generate_array(
    (SELECT 0 FROM min_and_max) # min or 0, depending on your start
    , (SELECT max FROM min_and_max)
    , (SELECT POW(10, fhoffa.x.int(LOG10(max-min)))/10 FROM min_and_max) # log10 for round order of 10 steps
  )) x
)

SELECT *
FROM generate_buckets

With those buckets, you can get an histogram now:

SELECT bucket_min, bucket_max, COUNT(*) c
FROM generate_buckets
JOIN data
ON data.gdp_capita >= bucket_min AND data.gdp_capita < bucket_max 
GROUP BY 1,2 
ORDER BY 1

If you also need the buckets with 0 elements:

SELECT * REPLACE(IFNULL(c,0) AS c)
FROM (
  SELECT bucket_min, bucket_max, COUNT(*) c
  FROM generate_buckets
  JOIN data
  ON data.gdp_capita >= bucket_min AND data.gdp_capita < bucket_max AND data.one=generate_buckets.one
  GROUP BY 1,2 
  ORDER BY 1
) 
RIGHT JOIN generate_buckets USING(bucket_min, bucket_max)

来源：https://stackoverflow.com/questions/60157499/create-ranges-based-on-the-data

标签

sql

google-bigquery