How can I monitor incurred BigQuery billings costs (jobs completed) by table/dataset in real-time?

问题

The biggest chunk of my BigQuery billing comes from query consumption. I am trying to optimize this by understanding which datasets/tables consume the most.

I am therefore looking for a way to track my BigQuery usage, but ideally something that is more in realtime (that I don't have to wait a day before I get the final results). The best way would be for instance how much each table/dataset consumed in the last hour.

So far I managed to find the Dashboard Monitoring but this only allows to display the queries in flight per project and the stored bytes per table, which is not what I am after.

What other solutions are there to retrieve this kind of information?

回答1:

Using Stackdriver logs, you could create a sink with Pub/Sub topic as target for real-time analysis that filter only BigQuery logs like this :

resource.type="bigquery_resource" AND
proto_payload.method_name="jobservice.jobcompleted" AND
proto_payload.service_data.job_completed_event.job.job_statistics.total_billed_bytes:*

(see example queries here : https://cloud.google.com/logging/docs/view/query-library?hl=en_US#bigquery-filters)

You could create the sink on a specific project, a folder or even an organization. This will retrieve all the queries done in BigQuery in that specific project, folder or organization.

The field proto_payload.service_data.job_completed_event.job.job_statistics.total_billed_bytes will give you the number of bytes processed by the query.

Based on on-demand BigQuery pricing (as of now, $5/TB for most regions, but check for your own region), you could easily estimate in real-time the billing. You could create a Dataflow job that aggregates the results in BigQuery, or simply consume the destination Pub/Sub topic with any job you want to make the pricing calculation :

jobPriceInUSD = totalBilledBytes / 1_000_000_000_000 * pricePerTB

because 1 TB = 1_000_000_000_000 B. As I said before, pricePerTB depends on regions (see : (https://cloud.google.com/bigquery/pricing#on_demand_pricing for the exact price). For example, as of time of writing :

$5/TB for us-east1
$6/TB for asia-northeast1
$9/TB for southamerica-east1

Also, for each month, as of now, the 1st TB is free.

回答2:

It might be easier to use the INFORMATION_SCHEMA.JOBS_BY_* views because you don't have to set up the stackdriver logging and can use them right away.

Example taken & modified from How to monitor query costs in Google BigQuery

DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;

SELECT
 ROUND(SUM(total_bytes_processed) / gb_divisor,2) as bytes_processed_in_gb,
 ROUND(SUM(IF(cache_hit != true, total_bytes_processed, 0)) * cost_factor,4) as cost_in_dollar,
 user_email,
FROM (
  (SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
  UNION ALL
  (SELECT * FROM `other-project.region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
)
WHERE
  DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
GROUP BY 
  user_email

Some caveats:

you need to UNION ALL all of the projects that you use explicitly
JOBS_BY_USER did not work for me on my private account (supposedly because me login email is @googlemail and big query stores my email as @gmail`)
the WHERE condition needs to be adjusted for your billing period (instead of the last 30 days)
doesn't provide the "bytes billed" information, so we need to determine those based on the cache usage
doesn't include the "if less than 10MB use 10MB" condition
data is only retained for the past 180 days
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5; reflects only US costs - other regions might have different costs - see https://cloud.google.com/bigquery/pricing#on_demand_pricing
you can only query one region at a time

来源：https://stackoverflow.com/questions/61477907/how-can-i-monitor-incurred-bigquery-billings-costs-jobs-completed-by-table-dat

标签

google-cloud-platform

google-bigquery