BigQuery - querying only a subset of keys in a table with key value schema

问题

So I have a table with the following schema:

timestamp: TIMESTAMP
key: STRING
value: FLOAT

There are around 200 unique keys. I am partitioning the dataset by date.

I want to run several (5-6 currently, but I expect to add at least 15 more) queries on a daily basis on this database. Brute forcing these would cost me a lot daily, which I want to avoid.

The issue is that because of this key - value format, and BigQuery being a columnar database, each query queries the whole day's data, despite each query actually using a maximum of 4 keys. What is a best way to optimize this?

I am thinking the best way I can go about it right now is to create separate temp tables for each key as a daily batch process, run my queries on them and then delete them.

Ideal way I would want to go about it is partitioning by key, I am not sure there is any such provision?

回答1:

You can try using recently introduced clustering partitioned tables

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.

Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.

Update (moved from comments)

Also have in mind below

Feature          Partitioning   Clustering
---------------  -------------  -------------
Cardinality      Less than 10k  Unlimited    
Dry Run Pricing  Available      Not available    
Query Pricing    Exact          Best Effort

Pay special attention to Dry Run Pricing - unfortunately - clustered tables do not support dry run (validation) based on clustered keys - and rather show only validation based on partitions. but if you set your clustering properly - actual run will end up with lower cost. you should try with smaller data to get comfortable with this

See more at Clustering partitioned tables

来源：https://stackoverflow.com/questions/51594068/bigquery-querying-only-a-subset-of-keys-in-a-table-with-key-value-schema

标签

google-cloud-platform

google-bigquery