Get the top patent countries, codes in a BQ public dataset

陌路散爱 提交于 2020-01-11 14:13:21

问题


I am trying to use an analytic function to get the top 2 countries with patent applications, and within those top 2 countries, get the top 2 application kinds. For example, the answer will look something like this:

country  -   code 
US           P
US           A
GB           X
GB           P

Here is the query I am using to get this:

SELECT
  country_code,
  MIN(count_country_code) count_country_code,
  application_kind
FROM (
  WITH
    A AS (
    SELECT
      country_code,
      COUNT(country_code) OVER (PARTITION BY country_code) AS count_country_code,
      application_kind
    FROM
      `patents-public-data.patents.publications`),
    B AS (
    SELECT
      country_code,
      count_country_code,
      DENSE_RANK() OVER(ORDER BY count_country_code DESC) AS country_code_num,
      application_kind,
      DENSE_RANK() OVER(PARTITION BY country_code ORDER BY count_country_code DESC) AS application_kind_num
    FROM
      A)
  SELECT
    country_code,
    count_country_code,
    application_kind
  FROM
    B
  WHERE
    country_code_num <= 2
    AND application_kind_num <= 2) x
GROUP BY
  country_code,
  application_kind
ORDER BY
  count_country_code DESC

However, unfortunately, I get a "memory exceeded" error due to the over/order/partition. Here is the message:

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 112% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%

How would I go about doing the above query (or other similar queries) without running into these memory errors? This can be tested on the public dataset here.

One crude way to do it (which only works if the fields have a semi-low cardinality), would be to do it as a straightforward aggregation operation and sort the results in-memory outside the DB. For example:


回答1:


Below is for BigQuery Standard SQL

#standardSQL
WITH A AS (
  SELECT country_code
  FROM `patents-public-data.patents.publications`
  GROUP BY country_code
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    country_code,
    application_kind,
    COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications`
  WHERE country_code IN (SELECT country_code FROM A)
  GROUP BY country_code, application_kind
), C AS (
  SELECT
    country_code,
    application_kind,
    application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B
)
SELECT
  country_code,
  application_kind,
  application_kind_count
FROM C
WHERE application_kind_rank <= 2  

with result



来源:https://stackoverflow.com/questions/56283851/get-the-top-patent-countries-codes-in-a-bq-public-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!