BigQuery Integer Partitions - can I use the results of another query to get a list of the partitions to access?

走远了吗. 提交于 2020-04-12 06:38:25

问题


I have a large table using integer partitions (~1TB). I need to regularly make several small subsets of this table. This was costing a lot, but using integer partitions I can decrease the cost by like 95%. It looks something like this.

tbl_a : partition_index IN (1, 2, 5, 6, 7, 10, 11, 15, 104, 106, 111)

tbl_b : partition_index IN (3, 4, 5, 20, 21, 25, 16, 84, 201, 301, 302, 303)

and so on an so forth, with different subtables using different subsets of the index. Its ugly as all hell, but it works. I'm concerned this will be difficult to maintain if I need to make a new subtable, and the potential permutations change and I have to edit all the .sql files for new sets of index values. I have a small table that has all the different permutations of the criteria I want, along with the associated index value. a 5Kb query on this index lookup table with the actual subtable selection criteria yields a list of index values, that if copied and pasted right into the .sql files, keeps everything working properly.

However, for architectural reasons, I cannot extract the index values from a subquery and insert them as a string into the .sql files prior to execution. I mean, I could, and it would work. But its hacky and bad and not reasonable solution. However, I can't find a way to get the results of the small query on the lookup table to be used properly. It always results in a full table scan. Any ideas here?

I guess an equivalent problem would be if I had a big data table partitioned on customerID, but I only had the customer name. BQ seems to want me to query the name lookup table to get the ID, then submit a second query with the customerID as a string literal. I'd like to be able to do this in a single query. But I'm stumped.


回答1:


Let me reproduce your problem.

SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
WHERE DATE(datehour) IN ('2019-03-27', '2019-04-10', '2019-05-10', '2019-10-10')
AND wiki='en'
AND title = 'Barbapapa'

1.4GB processed.

But now you have a table with those dates:

CREATE TABLE temp.some_dates AS (
  SELECT * 
  FROM UNNEST([DATE('2019-03-27'), '2019-04-10', '2019-05-10', '2019-10-10']) date
);

And now we will run a query that takes the values out of that table:

SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
WHERE DATE(datehour) IN (SELECT * FROM temp.some_dates)
AND wiki='en'
AND title = 'Barbapapa'

1.4 GB processed.

No problem here: the same amount of data was processed! Why? This table is clustered, cluster your tables.

  • https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b

But let's see v2 of that table, were things are not clustered:

SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v2.pageviews_2019` 
WHERE DATE(datehour) IN ('2019-03-27', '2019-04-10', '2019-05-10', '2019-10-10')
AND wiki='en'
AND title = 'Barbapapa'

26.5 GB processed. That's a lot more than 1.4GB. If I only had clustered this table.

And if we get the dates out of a different table?

SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v2.pageviews_2019` 
WHERE DATE(datehour) IN (SELECT * FROM `temp.some_dates`)
AND wiki='en'
AND title = 'Barbapapa'

2.3 TB.

Wow, that was a really big table scan. I should have clustered my tables.

But can I fix this somehow?

Yes:

DECLARE some_dates ARRAY<DATE> DEFAULT (SELECT ARRAY_AGG(date) FROM `temp.some_dates`);


SELECT MAX(views) max_views
FROM `fh-bigquery.wikipedia_v2.pageviews_2019` 
WHERE DATE(datehour) IN UNNEST(some_dates)
AND wiki='en'
AND title = 'Barbapapa'

26.46 GB processed.

Not as good as a clustered table, but at least we used the partitioning thanks to a script ran inside BigQuery: First declare a variable, then use the contents of it.

Still, my best advice is: Cluster your tables.

  • https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b


来源:https://stackoverflow.com/questions/60894346/bigquery-integer-partitions-can-i-use-the-results-of-another-query-to-get-a-li

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!