setting up s3 for logs in airflow

后端 未结 7 1631
后悔当初
后悔当初 2020-11-27 05:50

I am using docker-compose to set up a scalable airflow cluster. I based my approach off of this Dockerfile https://hub.docker.com/r/puckel/docker-airflow/

My problem

7条回答
  •  没有蜡笔的小新
    2020-11-27 06:34

    (Updated as of Airflow 1.10.2)

    Here's a solution if you don't use the admin UI.

    My Airflow doesn't run on a persistent server ... (It gets launched afresh every day in a Docker container, on Heroku.) I know I'm missing out on a lot of great features, but in my minimal setup, I never touch the admin UI or the cfg file. Instead, I have to set Airflow-specific environment variables in a bash script, which overrides the .cfg file.

    apache-airflow[s3]

    First of all, you need the s3 subpackage installed to write your Airflow logs to S3. (boto3 works fine for the Python jobs within your DAGs, but the S3Hook depends on the s3 subpackage.)

    One more side note: conda install doesn't handle this yet, so I have to do pip install apache-airflow[s3].

    Environment variables

    In a bash script, I set these core variables. Starting from these instructions but using the naming convention AIRFLOW__{SECTION}__{KEY} for environment variables, I do:

    export AIRFLOW__CORE__REMOTE_LOGGING=True
    export AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://bucket/key
    export AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_uri
    export AIRFLOW__CORE__ENCRYPT_S3_LOGS=False
    

    S3 connection ID

    The s3_uri above is a connection ID that I made up. In Airflow, it corresponds to another environment variable, AIRFLOW_CONN_S3_URI. The value of that is your S3 path, which has to be in URI form. That's

    s3://access_key:secret_key@bucket/key
    

    Store this however you handle other sensitive environment variables.

    With this configuration, Airflow will be able to write your logs to S3. They will follow the path of s3://bucket/key/dag/task_id/timestamp/1.log.


    Appendix on upgrading from Airflow 1.8 to Airflow 1.10

    I recently upgraded my production pipeline from Airflow 1.8 to 1.9, and then 1.10. Good news is that the changes are pretty tiny; the rest of the work was just figuring out nuances with the package installations (unrelated to the original question about S3 logs).

    (1) First of all, I needed to upgrade to Python 3.6 with Airflow 1.9.

    (2) The package name changed from airflow to apache-airflow with 1.9. You also might run into this in your pip install.

    (3) The package psutil has to be in a specific version range for Airflow. You might encounter this when you're doing pip install apache-airflow.

    (4) python3-dev headers are needed with Airflow 1.9+.

    (5) Here are the substantive changes: export AIRFLOW__CORE__REMOTE_LOGGING=True is now required. And

    (6) The logs have a slightly different path in S3, which I updated in the answer: s3://bucket/key/dag/task_id/timestamp/1.log.

    But that's it! The logs did not work in 1.9, so I recommend just going straight to 1.10, now that it's available.

提交回复
热议问题