Airflow unable to iterate through xcom_pull list with GoogleCloud Operatos

假装没事ソ 提交于 2020-01-16 09:07:56

问题


I would like to dynamically get the list of csv files on gcs bucket and then dump each one to a corresponding BQ table.

I am using GoogleCloudStorageListOperator and GoogleCloudStorageToBigQueryOperator operators

GCS_Files = GoogleCloudStorageListOperator(
                task_id='GCS_Files',
                bucket=cf.storage.import_bucket_name,
                prefix='20190701/',
                delimiter='.csv',
                dag=dag
            )

for idx, elem in enumerate(["{{ task_instance.xcom_pull(task_ids='GCS_Files') }}"]):
    storage_to_bigquery = GoogleCloudStorageToBigQueryOperator(
            task_id='storage_to_bigquery',
            bucket=cf.storage.import_bucket_name,
            create_disposition='CREATE_IF_NEEDED',
            autodetect=True,
            destination_project_dataset_table=f"{cf.project}.{cf.bigquery.core_dataset_name}.{idx}",
            skip_leading_rows=1,
            source_format='CSV', 
            source_objects=[f'{elem}'],
            write_disposition='WRITE_TRUNCATE',
            dag=dag
            )

    storage_to_bigquery.set_upstream(GCS_Files)

However the list fails to iterate one at a time throwing the below error.

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://bigquery.googleapis.com/bigquery/v2/projects/my-project/jobs?alt=json returned "Source URI must not contain the ',' character: gs://mybucket/['20190701/file0.csv', '20190701/file1.csv', '20190701/file2.csv']">

Any suggestions? thanks in advance.


回答1:


You cannot call macro from everywhere in your code. This is seen as a string in your code: "{{ task_instance.xcom_pull(task_ids='GCS_Files') }}" And later evaluated by Jinja2 when passed in the gcp operator because you are using a templated field: https://github.com/apache/airflow/blob/21a7e7ec67ac7a391d837aa7c13c0825683f1349/airflow/contrib/operators/gcs_to_bq.py#L140

To be able to call task_instance.xcom_pull, you need to have a context, which can only exist in a DAG run. When Airflow lazily evaluates the DAG, this is not available.

In your case, the best would be to use a SubDAG to loop over your Operator using the your macro to generate the list of file to loop over: https://airflow.apache.org/concepts.html#subdags



来源:https://stackoverflow.com/questions/56888744/airflow-unable-to-iterate-through-xcom-pull-list-with-googlecloud-operatos

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!