问题
I am trying to create tasks dynamically based on response of a database call. But when I do this the run option just don't come in Airflow, so I cant run.
Here s the code:
tables = ['a','b','c'] // This works
#tables = get_tables() // This never works
check_x = python_operator.PythonOperator(task_id="verify_loaded",
python_callable = lambda: verify_loaded(tables)
)
bridge = DummyOperator(
task_id='bridge'
)
check_x >> bridge
for vname in tables:
sql = ("SELECT * FROM `asd.temp.{table}` LIMIT 5".format(table= vname ))
log.info(vname)
materialize__bq = BigQueryOperator( sql=sql,
destination_dataset_table="asd.temp." + table_prefix + vname,
task_id = "materialize_" + vname,
bigquery_conn_id = "bigquery_default",
google_cloud_storage_conn_id="google_cloud_default",
use_legacy_sql = False,
write_disposition = "WRITE_TRUNCATE",
create_disposition = "CREATE_IF_NEEDED",
query_params = {},
allow_large_results = True
)
bridge >> materialize__bq
def get_tables():
bq_hook = BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=False)
my_query = ("SELECT table_id FROM `{project}.{dataset}.{table}` LIMIT 3;".format(
project=project, dataset=dataset, table='__TABLES__'))
df = bq_hook.get_pandas_df(sql=my_query, dialect='standard')
return view_names
I am trying to make the commented part work but no way. The get_tables() function fetches tablenames from bigquery and I wanted to make it work dynamically this way. When I do this, I dont get the option to run AND IT LOOKS LIKE dag IS broken. Any help? Trying for a long time.
Here is screenshot:
回答1:
To understand the problem we must check Composer architecture
https://cloud.google.com/composer/docs/concepts/overview
The scheduler runs in GKE using the service account configured when you created the Composer instance
The web UI runs in a tenant project in App Engine using a different service account. The resources of this tenant project are hidden (you don't see the App Engine application, the Cloud SQL instance or the service account in the project resources)
When the web UI parses the DAG file, it tries to access BigQuery using the connection 'bigquery_default'.
Checking airflow GCP _get_credentials
source code
https://github.com/apache/airflow/blob/1.10.2/airflow/contrib/hooks/gcp_api_base_hook.py#L74
If you have not configured the connection in the airflow admin, it will use google.auth.default
method for connecting to BigQuery using the tenant project service account. This service account doesn't have permissions to access BigQuery, it will get an unauthorised error and will not able to generate the DAG in the UI. Probably if you check in Stackdriver, you will find the BigQuery error.
On the other side, the airflow scheduler uses the service account used in Composer creation, that have the right permissions and it parses the DAG correctly
If you execute the code in your local airflow instance, as the Web UI and the Scheduler use the same service account it works as expected in both cases
The easiest solution is to add to the bigquery_default connection a Keyfile Path o Keyfile JSON to avoid using the default service account in the web UI
If you have any security concern with this solution (service account credentials will be available to anyone with access to Composer) another option is to restructure the code to execute all your code inside a PythonOperator. This PythonOperator will call get_table and then will loop executing the BigQuery commands (using a BigQueryHook instead of a BigQueryOperator). The problem of this solution is that you will have a single task instead of a task per table
来源:https://stackoverflow.com/questions/57707896/problems-in-making-database-requests-in-airflow