Problems in making database requests in airflow

一笑奈何 提交于 2019-12-13 03:31:02

问题


I am trying to create tasks dynamically based on response of a database call. But when I do this the run option just don't come in Airflow, so I cant run.

Here s the code:

tables =  ['a','b','c'] // This works
#tables =  get_tables() // This never works

check_x = python_operator.PythonOperator(task_id="verify_loaded",
                                             python_callable = lambda: verify_loaded(tables)
                                             ) 
bridge = DummyOperator(
    task_id='bridge'
)

check_x >> bridge

for vname in tables:
    sql = ("SELECT * FROM `asd.temp.{table}` LIMIT 5".format(table= vname ))

    log.info(vname)
    materialize__bq = BigQueryOperator( sql=sql,
                                            destination_dataset_table="asd.temp." + table_prefix + vname,
                                            task_id = "materialize_" + vname,
                                            bigquery_conn_id = "bigquery_default",
                                            google_cloud_storage_conn_id="google_cloud_default",
                                            use_legacy_sql = False,
                                            write_disposition = "WRITE_TRUNCATE",
                                            create_disposition = "CREATE_IF_NEEDED",
                                            query_params = {},
                                            allow_large_results = True
                                          )

    bridge >> materialize__bq


 def get_tables(): 

    bq_hook = BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=False)
    my_query = ("SELECT table_id FROM `{project}.{dataset}.{table}` LIMIT 3;".format(
    project=project, dataset=dataset, table='__TABLES__'))

    df = bq_hook.get_pandas_df(sql=my_query, dialect='standard')
    return view_names

I am trying to make the commented part work but no way. The get_tables() function fetches tablenames from bigquery and I wanted to make it work dynamically this way. When I do this, I dont get the option to run AND IT LOOKS LIKE dag IS broken. Any help? Trying for a long time.

Here is screenshot:


回答1:


To understand the problem we must check Composer architecture

https://cloud.google.com/composer/docs/concepts/overview

The scheduler runs in GKE using the service account configured when you created the Composer instance

The web UI runs in a tenant project in App Engine using a different service account. The resources of this tenant project are hidden (you don't see the App Engine application, the Cloud SQL instance or the service account in the project resources)

When the web UI parses the DAG file, it tries to access BigQuery using the connection 'bigquery_default'. Checking airflow GCP _get_credentials source code

https://github.com/apache/airflow/blob/1.10.2/airflow/contrib/hooks/gcp_api_base_hook.py#L74

If you have not configured the connection in the airflow admin, it will use google.auth.default method for connecting to BigQuery using the tenant project service account. This service account doesn't have permissions to access BigQuery, it will get an unauthorised error and will not able to generate the DAG in the UI. Probably if you check in Stackdriver, you will find the BigQuery error.

On the other side, the airflow scheduler uses the service account used in Composer creation, that have the right permissions and it parses the DAG correctly

If you execute the code in your local airflow instance, as the Web UI and the Scheduler use the same service account it works as expected in both cases

The easiest solution is to add to the bigquery_default connection a Keyfile Path o Keyfile JSON to avoid using the default service account in the web UI

If you have any security concern with this solution (service account credentials will be available to anyone with access to Composer) another option is to restructure the code to execute all your code inside a PythonOperator. This PythonOperator will call get_table and then will loop executing the BigQuery commands (using a BigQueryHook instead of a BigQueryOperator). The problem of this solution is that you will have a single task instead of a task per table



来源:https://stackoverflow.com/questions/57707896/problems-in-making-database-requests-in-airflow

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!