问题
I am trying to fetch results from BigQueryOperator
using airflow but I could not find a way to do it. I tried calling the next()
method in the bq_cursor
member (available in 1.10) however it returns None
. This is how I tried to do it
import datetime
import logging
from airflow import models
from airflow.contrib.operators import bigquery_operator
from airflow.operators import python_operator
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time()
)
def MyChequer(**kwargs):
big_query_count = bigquery_operator.BigQueryOperator(
task_id='my_bq_query',
sql='select count(*) from mydataset.mytable'
)
big_query_count.execute(context=kwargs)
logging.info(big_query_count)
logging.info(big_query_count.__dict__)
logging.info(big_query_count.bq_cursor.next())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'project_id': 'myproject'
}
with models.DAG(
'bigquery_results_execution',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
myoperator = python_operator.PythonOperator(
task_id='threshold_operator',
provide_context=True,
python_callable=MyChequer
)
# Define DAG
myoperator
Taking a look to bigquery_hook.py and bigquery_operator.py it seems to be the only available way to fetch the results.
回答1:
I create my own operator using the BigQuery hook whenever I need to get the data from a BigQuery query and use it for something.I usually call this a BigQueryToXOperator and we have a bunch of these for sending BigQuery data to other internal systems.
For example, I have a BigQueryToPubSub operator that you might find useful as an example for how to query BigQuery and then handle the results on a row by row basis, sending them to Google PubSub. Consider the following generalized sample code for how to do this on your own:
class BigQueryToXOperator(BaseOperator):
template_fields = ['sql']
ui_color = '#000000'
@apply_defaults
def __init__(
self,
sql,
keys,
bigquery_conn_id='bigquery_default',
delegate_to=None,
*args,
**kwargs):
super(BigQueryToXOperator, self).__init__(*args, **kwargs)
self.sql = sql
self.keys = keys # A list of keys for the columns in the result set of sql
self.bigquery_conn_id = bigquery_conn_id
self.delegate_to = delegate_to
def execute(self, context):
"""
Run query and handle results row by row.
"""
cursor = self._query_bigquery()
for row in cursor.fetchall():
# Zip keys and row together because the cursor returns a list of list (not list of dicts)
row_dict = dumps(dict(zip(self.keys,row))).encode('utf-8')
# Do what you want with the row...
handle_row(row_dict)
def _query_bigquery(self):
"""
Queries BigQuery and returns a cursor to the results.
"""
bq = BigQueryHook(bigquery_conn_id=self.bigquery_conn_id,
use_legacy_sql=False)
conn = bq.get_conn()
cursor = conn.cursor()
cursor.execute(self.sql)
return cursor
回答2:
You can use BigQueryOperator
to save results in a temporary destination table and then use BigQueryGetDataOperator
to fetch the results as below and then use BigQueryTableDeleteOperator
to delete the table:
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
Docs:
- BigQueryGetDataOperator: https://airflow.readthedocs.io/en/1.10.0/integration.html#bigquerygetdataoperator
- BigQueryTableDeleteOperator: https://airflow.readthedocs.io/en/1.10.0/integration.html#bigquerytabledeleteoperator
回答3:
Thanks to @kaxil and @Mike for their answers. I found the problem. There is a kind of bug (in my mind) in the BigQueryCursor
. As part of the run_with_configuration
, the running_job_id
is being returned but never assigned to job_id
which is used to pull the results in the next
method. A workaround (not really elegant but good if you do not want to re-implement everything), is assign the job_id
based on the running_job_id
in the hook like this
big_query_count.execute(context=kwargs)
#workaround
big_query_count.bq_cursor.job_id = big_query_count.bq_cursor.running_job_id
logging.info(big_query_count.bq_cursor.next())
One the problem get fixed on the run_with_configuration
assigning the correct job_id at the end of the process, the row after workaround can be removed
来源:https://stackoverflow.com/questions/53565834/fetch-results-from-bigqueryoperator-in-airflow