问题
I am trying to get an optimised query to find the top 5 most visited wikipedia pages in each month in 2019 from the public dataset fh-bigquery.wikipedia_v3.pageviews_2019
. I have come up with the below query but I need two things:
The query runs for about 2 minutes and processes 2.3 TB irrespective of whether I filter top 5 ranks or display them all in the outer query. Is there a better optimised query to process less volume and in less time - something by which we can limit the number of records fetched in the first place itself (using top 5 or limit 5 in the subquery itself)?
Sort the month based on calendar order rather than alphabetical order.
Query:
select *
from (
select Month_2019, title, tot_views,
rank() over (partition by Month_2019 order by tot_views desc) as view_rank
from (
select format_date("%B", Date(datehour)) as Month_2019,
title, sum(views) as tot_views
from `fh-bigquery.wikipedia_v3.pageviews_2019`
where wiki='en'
and title not in ('Main_Page','-','Special:Search','Special:CreateAccount','Special:Watchlist','Special:ElectronPdf','Special:Book','Special:CiteThisPage','Special:RecentChanges','Portal:Current_events','Wikipedia')
and datehour between '2019-01-01' and '2019-12-31'
group by Month_2019, title
))
where view_rank<6
order by 1,4
Expected Output:
January Louis_Tomlinson 5075908 1
January Deaths_in_2019 1832404 2
January TCP_delayed_acknowledgment 1238559 3
January Ted_Bundy 1190672 4
January Glass_(2019_film) 1018119 5
February Louis_Tomlinson 5504517
February Grover 4970493
February Rheology 2852186
February Deaths_in_2019
February Operating_system
March.... ....
回答1:
To the question
The query runs for about 2 minutes and processes 2.3 TB irrespective of whether I filter top 5 ranks or display them all in the outer query. Is there a better optimised query to process less volume and in less time
The query is already optimized! Note that it didn't process 2.3TB, nor it took 2 minutes. The running time and bytes queries was 1 min 20 sec elapsed, 440.1 GB processed
. This because the table is clustered.
I can improve the running time of the query with ARRAY_AGG
, and a better sort:
select Month_2019, ARRAY_AGG(STRUCT(title, tot_views) ORDER BY tot_views DESC LIMIT 5)
from (
select format_date("%B", Date(datehour)) as Month_2019,
title, sum(views) as tot_views, MIN(datehour) month_for_sorting
from `fh-bigquery.wikipedia_v3.pageviews_2019`
where wiki='en'
and title not in ('Main_Page','-','Special:Search','Special:CreateAccount','Special:Watchlist','Special:ElectronPdf','Special:Book','Special:CiteThisPage','Special:RecentChanges','Portal:Current_events','Wikipedia')
and datehour between '2019-01-01' and '2019-12-31'
group by Month_2019, title
)
GROUP BY 1
order by MIN(month_for_sorting)
And if we remove all rows WHERE views<3
, then the query runs in only 12 seconds:
select Month_2019, ARRAY_AGG(STRUCT(title, tot_views) ORDER BY tot_views DESC LIMIT 5)
from (
select format_date("%B", Date(datehour)) as Month_2019,
title, sum(views) as tot_views, MIN(datehour) month_for_sorting
from `fh-bigquery.wikipedia_v3.pageviews_2019`
where wiki='en'
AND views > 3
and title not in ('Main_Page','-','Special:Search','Special:CreateAccount','Special:Watchlist','Special:ElectronPdf','Special:Book','Special:CiteThisPage','Special:RecentChanges','Portal:Current_events','Wikipedia')
and datehour between '2019-01-01' and '2019-12-31'
group by Month_2019, title
)
GROUP BY 1
order by MIN(month_for_sorting)
Note that each time we only processed 440GB.
If you're planning to continue playing with queries like this - extract all of the interesting rows (eg, filter out views < 3
), to a new table for even less GBs queried each time.
For example:
CREATE TABLE `fh-bigquery.wikipedia_extracts.2019_en_m_daily`
PARTITION BY date
CLUSTER BY title
AS
SELECT DATE(TIMESTAMP_TRUNC(datehour, DAY)) date, SUM(views) views, title
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE wiki='en.m'
AND title not in ('Main_Page','-','Portal:Current_events','Wikipedia')
AND NOT title LIKE 'Special:%'
GROUP BY month, title
HAVING views > 100
# 33.1 sec elapsed, 420.7 GB processed
Note I switched wiki=en
to wiki=en.m
for mobile results. And now queries only process 2.5GB:
SELECT month
, ARRAY_AGG(STRUCT(SUBSTR(title,0,21), views) ORDER BY views DESC LIMIT 5)
from (
SELECT DATE_TRUNC(date, MONTH) month, title, SUM(views) as views
FROM `fh-bigquery.wikipedia_extracts.2019_en_m_daily`
WHERE title NOT LIKE 'File%'
GROUP BY month, title
)
GROUP BY 1
ORDER BY month
来源:https://stackoverflow.com/questions/59462279/bigquery-optimised-query-to-get-top-5-most-visited-wikipedia-pages-in-each-mont