setting up s3 for logs in airflow

匿名 (未验证) 提交于 2019-12-03 01:27:01

问题:

I am using docker-compose to set up a scalable airflow cluster. I based my approach off of this Dockerfile https://hub.docker.com/r/puckel/docker-airflow/

My problem is getting the logs set up to write/read from s3. When a dag has completed I get an error like this

*** Log file isn't local. *** Fetching here: http://ea43d4d49f35:8793/log/xxxxxxx/2017-06-26T11:00:00 *** Failed to fetch log file from worker.  *** Reading remote logs... Could not read logs from s3://buckets/xxxxxxx/airflow/logs/xxxxxxx/2017-06- 26T11:00:00 

I set up a new section in the airflow.cfg file like this

[MyS3Conn] aws_access_key_id = xxxxxxx aws_secret_access_key = xxxxxxx aws_default_region = xxxxxxx 

And then specified the s3 path in the remote logs section in airflow.cfg

remote_base_log_folder = s3://buckets/xxxx/airflow/logs remote_log_conn_id = MyS3Conn 

Did I set this up properly and there is a bug? Is there a recipe for success here that I am missing?

-- Update

I tried exporting in URI and JSON formats and neither seemed to work. I then exported the aws_access_key_id and aws_secret_access_key and then airflow started picking it up. Now I get his error in the worker logs

6/30/2017 6:05:59 PMINFO:root:Using connection to: s3 6/30/2017 6:06:00 PMERROR:root:Could not read logs from s3://buckets/xxxxxx/airflow/logs/xxxxx/2017-06-30T23:45:00 6/30/2017 6:06:00 PMERROR:root:Could not write logs to s3://buckets/xxxxxx/airflow/logs/xxxxx/2017-06-30T23:45:00 6/30/2017 6:06:00 PMLogging into: /usr/local/airflow/logs/xxxxx/2017-06-30T23:45:00 

-- Update

I found this link as well https://www.mail-archive.com/dev@airflow.incubator.apache.org/msg00462.html

I then shelled into one of my worker machines (separate from the webserver and scheduler) and ran this bit of code in python

import airflow s3 = airflow.hooks.S3Hook('s3_conn') s3.load_string('test', airflow.conf.get('core', 'remote_base_log_folder')) 

I receive this error.

boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden 

I tried exporting several different types of AIRFLOW_CONN_ envs as explained here in the connections section https://airflow.incubator.apache.org/concepts.html and by other answers to this question.

s3://:@S3  {"aws_account_id":"","role_arn":"arn:aws:iam:::role/"}  {"aws_access_key_id":"","aws_secret_access_key":""} 

I have also exported AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with no success.

These credentials are being stored in a database so once I add them in the UI they should be picked up by the workers but they are not able to write/read logs for some reason.

回答1:

You need to set up the s3 connection through airflow UI. For this, you need to go to the Admin -> Connections tab on airflow UI and create a new row for your S3 connection.

An example configuration would be:

Conn Id: my_conn_S3

Conn Type: S3

Extra: {"aws_access_key_id":"your_aws_key_id", "aws_secret_access_key": "your_aws_secret_key"}



回答2:

NOTE: As of Airflow 1.9.0 remote logging has been significantly altered.

For 1.9+, follow steps 1-9 here to get remote logs, replacing gcs with s3.

EDIT -- Complete Instructions:

  1. Create a directory to store configs and place this so that it can be found in PYTHONPATH. One example is $AIRFLOW_HOME/config

  2. Create empty files called $AIRFLOW_HOME/config/log_config.py and $AIRFLOW_HOME/config/__init__.py

  3. Copy the contents of airflow/config_templates/airflow_local_settings.py into the log_config.py file that was just created in the step above.

  4. Customize the following portions of the template:

    #Add this variable to the top of the file. Note the trailing slash. S3_LOG_FOLDER = 's3:///'  Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG LOGGING_CONFIG = ...  Add a S3TaskHandler to the 'handlers' block of the LOGGING_CONFIG variable 's3.task': {     'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',     'formatter': 'airflow.task',     'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),     's3_log_folder': S3_LOG_FOLDER,     'filename_template': FILENAME_TEMPLATE, },   Update the airflow.task and airflow.task_runner blocks to be 's3.task' instead >of 'file.task'. 'loggers': {     'airflow.task': {         'handlers': ['s3.task'],         ...     },     'airflow.task_runner': {         'handlers': ['s3.task'],         ...     },     'airflow': {         'handlers': ['console'],         ...     }, } 
  5. Make sure a s3 connection hook has been defined in Airflow, as per the above answer. The hook should have read and write access to the s3 bucket defined above in S3_LOG_FOLDER.

  6. Update $AIRFLOW_HOME/airflow.cfg to contain:

    task_log_reader = s3.task logging_config_class = log_config.LOGGING_CONFIG remote_log_conn_id = 
  7. Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution.

  8. Verify that logs are showing up for newly executed tasks in the bucket you’ve defined.

  9. Verify that the s3 storage viewer is working in the UI. Pull up a newly executed task, and verify that you see something like:

    *** Reading remote log from gs:///example_bash_operator/run_this_last/2017-10-03T00:00:00/16.log. [2017-10-03 21:57:50,056] {cli.py:377} INFO - Running on host chrisr-00532 [2017-10-03 21:57:50,093] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run example_bash_operator run_this_last 2017-10-03T00:00:00 --job_id 47 --raw -sd DAGS_FOLDER/example_dags/example_bash_operator.py'] [2017-10-03 21:57:51,264] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,263] {__init__.py:45} INFO - Using executor SequentialExecutor [2017-10-03 21:57:51,306] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,306] {models.py:186} INFO - Filling up the DagBag from /airflow/dags/example_dags/example_bash_operator.py 


回答3:

Here's a solution if you don't want to use the admin UI.

My deployment process is Dockerized, and I never touch the admin UI. I also like setting Airflow-specific environment variables in a bash script, which overrides the .cfg file.

airflow[s3]

First of all, you need the s3 subpackage installed to write your Airflow logs to S3. (boto3 works fine for the Python jobs within your DAGs, but the S3Hook depends on the s3 subpackage.)

One more side note: conda install doesn't handle this yet, so I have to do pip install airflow[s3].

Environment variables

In a bash script, I set these core variables. Starting from these instructions but using the naming convention AIRFLOW__{SECTION}__{KEY} for environment variables, I do:

export AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://bucket/key export AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_uri export AIRFLOW__CORE__ENCRYPT_S3_LOGS=False 

S3 connection ID

s3_uri is a connection ID that I made up. In Airflow, it corresponds to another environment variable, AIRFLOW_CONN_S3_URI. The value of that is your S3 path, which has to be in URI form. That's

s3://access_key:secret_key@bucket/key 

Store this however you handle other sensitive environment variables.

With this configuration, Airflow will write your logs to S3. They will follow the path of s3://bucket/key/dag/task_id.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!