Python - How to - Big Query asynchronous tasks

前端未结

关注

 3  1594

轻奢々 2021-02-10 13:35

This may be a dummy question but I cannot seem to be able to run python google-clood-bigquery asynchronously.

My goal is to run multiple queries concurrently and wait fo

3条回答

萌比男神i (楼主)

2021-02-10 14:00
just to share a different solution:
```
import numpy as np
from time import sleep


query1 = """
SELECT
  language.name,
  average(language.bytes)
FROM `bigquery-public-data.github_repos.languages` 
, UNNEST(language) AS language
GROUP BY language.name"""
query2 = 'SELECT 2'


def dummy_callback(future):
    global jobs_done
    jobs_done[future.job_id] = True


jobs = [bq.query(query1), bq.query(query2)]
jobs_done = {job.job_id: False for job in jobs}
[job.add_done_callback(dummy_callback) for job in jobs]

# blocking loop to wait for jobs to finish
while not (np.all(list(jobs_done.values()))):
    print('waiting for jobs to finish ... sleeping for 1s')
    sleep(1)

print('all jobs done, do your stuff')
```
Rather than using as_completed I prefer to use the built-in async functionality from the bigquery jobs themselves. This also makes it possible for me to decompose the datapipeline into separate Cloud Functions, without having to keep the main ThreadPoolExecutor live for the duration of the whole pipeline. Incidentally, this was the reason why I was looking into this: my pipelines are longer than the max timeout of 9 minutes for Cloud Functions (or even 15 minutes for Cloud Run).

Downside is I need to keep track of all the job_ids across the various functions, but that is relatively easy to solve when configuring the pipeline by specifying inputs and outputs such that they form a directed acyclic graph.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...