BigQuery: optimised query to get top 5 most visited wikipedia pages in each month

谁都会走 提交于 2020-01-14 04:07:06

问题


I am trying to get an optimised query to find the top 5 most visited wikipedia pages in each month in 2019 from the public dataset fh-bigquery.wikipedia_v3.pageviews_2019. I have come up with the below query but I need two things:

  1. The query runs for about 2 minutes and processes 2.3 TB irrespective of whether I filter top 5 ranks or display them all in the outer query. Is there a better optimised query to process less volume and in less time - something by which we can limit the number of records fetched in the first place itself (using top 5 or limit 5 in the subquery itself)?

  2. Sort the month based on calendar order rather than alphabetical order.

Query:

select * 
from (
  select Month_2019, title, tot_views,
    rank() over (partition by Month_2019 order by tot_views desc) as view_rank
  from (
    select format_date("%B", Date(datehour)) as Month_2019,
      title, sum(views) as tot_views
    from `fh-bigquery.wikipedia_v3.pageviews_2019`
    where wiki='en'
    and title not in ('Main_Page','-','Special:Search','Special:CreateAccount','Special:Watchlist','Special:ElectronPdf','Special:Book','Special:CiteThisPage','Special:RecentChanges','Portal:Current_events','Wikipedia')
    and datehour between '2019-01-01' and '2019-12-31'
    group by Month_2019, title
  ))
where view_rank<6
order by 1,4

Expected Output:

January Louis_Tomlinson 5075908 1

January Deaths_in_2019 1832404 2

January TCP_delayed_acknowledgment 1238559 3

January Ted_Bundy 1190672 4

January Glass_(2019_film) 1018119 5

February Louis_Tomlinson 5504517

February Grover 4970493

February Rheology 2852186

February Deaths_in_2019

February Operating_system

March.... ....


回答1:


To the question

The query runs for about 2 minutes and processes 2.3 TB irrespective of whether I filter top 5 ranks or display them all in the outer query. Is there a better optimised query to process less volume and in less time

The query is already optimized! Note that it didn't process 2.3TB, nor it took 2 minutes. The running time and bytes queries was 1 min 20 sec elapsed, 440.1 GB processed. This because the table is clustered.

I can improve the running time of the query with ARRAY_AGG, and a better sort:

select Month_2019, ARRAY_AGG(STRUCT(title, tot_views) ORDER BY tot_views DESC LIMIT 5)
from (
  select format_date("%B", Date(datehour)) as Month_2019,
    title, sum(views) as tot_views, MIN(datehour) month_for_sorting
  from `fh-bigquery.wikipedia_v3.pageviews_2019`
  where wiki='en'
  and title not in ('Main_Page','-','Special:Search','Special:CreateAccount','Special:Watchlist','Special:ElectronPdf','Special:Book','Special:CiteThisPage','Special:RecentChanges','Portal:Current_events','Wikipedia')
  and datehour between '2019-01-01' and '2019-12-31'
  group by Month_2019, title
)
GROUP BY 1
order by MIN(month_for_sorting)

And if we remove all rows WHERE views<3, then the query runs in only 12 seconds:

select Month_2019, ARRAY_AGG(STRUCT(title, tot_views) ORDER BY tot_views DESC LIMIT 5)
from (
  select format_date("%B", Date(datehour)) as Month_2019,
    title, sum(views) as tot_views, MIN(datehour) month_for_sorting
  from `fh-bigquery.wikipedia_v3.pageviews_2019`
  where wiki='en'
  AND views > 3
  and title not in ('Main_Page','-','Special:Search','Special:CreateAccount','Special:Watchlist','Special:ElectronPdf','Special:Book','Special:CiteThisPage','Special:RecentChanges','Portal:Current_events','Wikipedia')
  and datehour between '2019-01-01' and '2019-12-31'
  group by Month_2019, title
)
GROUP BY 1
order by MIN(month_for_sorting)

Note that each time we only processed 440GB.

If you're planning to continue playing with queries like this - extract all of the interesting rows (eg, filter out views < 3), to a new table for even less GBs queried each time.

For example:

CREATE  TABLE `fh-bigquery.wikipedia_extracts.2019_en_m_daily`
PARTITION BY date
CLUSTER BY title
AS
SELECT DATE(TIMESTAMP_TRUNC(datehour, DAY)) date, SUM(views) views, title
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE wiki='en.m'
AND title not in ('Main_Page','-','Portal:Current_events','Wikipedia')
AND NOT title LIKE 'Special:%'
GROUP BY month, title
HAVING views > 100


# 33.1 sec elapsed, 420.7 GB processed

Note I switched wiki=en to wiki=en.m for mobile results. And now queries only process 2.5GB:

SELECT month
  , ARRAY_AGG(STRUCT(SUBSTR(title,0,21), views) ORDER BY views DESC LIMIT 5)
from (
  SELECT DATE_TRUNC(date, MONTH) month, title, SUM(views) as views
  FROM `fh-bigquery.wikipedia_extracts.2019_en_m_daily`
  WHERE title NOT LIKE 'File%'
  GROUP BY month, title
)
GROUP BY 1
ORDER BY month



来源:https://stackoverflow.com/questions/59462279/bigquery-optimised-query-to-get-top-5-most-visited-wikipedia-pages-in-each-mont

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!