Wait until AWS Glue crawler has finished running

ぐ巨炮叔叔 提交于 2021-02-19 05:30:17

问题


In the documentation, I cannot find any way of checking the run status of a crawler. The only way I am doing it currently is constantly checking AWS to check if the file/table has been created.

Is there a better way to block until crawler finishes its run?


回答1:


You can use boto3 (or similar) to do it. There is the get_crawler method. You will find needed information in "LastCrawl" section

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_crawler




回答2:


The following function uses boto3. It starts the AWS Glue crawler and waits until its completion. It also logs the status as it progresses. It was tested with Python v3.8 with boto3 v1.17.3.

import logging
import time
import timeit

import boto3

log = logging.getLogger(__name__)


def glue_crawl(crawler: str, *, timeout_minutes: int = 60, retry_seconds: int = 5) -> None:
    """Crawl the specified AWS Glue crawler, waiting until completion."""
    # Ref: https://stackoverflow.com/a/66072347/
    timeout_seconds = timeout_minutes * 60
    client = boto3.client("glue")
    start_time = timeit.default_timer()
    abort_time = start_time + timeout_seconds

    def check_for_timeout() -> None:
        if timeit.default_timer() > abort_time:
            raise TimeoutError(f"Failed to crawl {crawler}. The allocated time of {timeout_minutes:,} minutes has elapsed.")

    def wait_until_ready() -> None:
        state_previous = None
        while True:
            response_get = client.get_crawler(Name=crawler)
            state = response_get["Crawler"]["State"]
            if state != state_previous:
                log.info(f"Crawler {crawler} is {state.lower()}.")
                state_previous = state
            if state == "READY":  # Other known states: RUNNING, STOPPING
                return
            check_for_timeout()
            time.sleep(retry_seconds)

    wait_until_ready()
    response_start = client.start_crawler(Name=crawler)
    assert response_start["ResponseMetadata"]["HTTPStatusCode"] == 200
    log.info(f"Crawling {crawler}.")
    wait_until_ready()
    log.info(f"Crawled {crawler}.")


来源:https://stackoverflow.com/questions/52996591/wait-until-aws-glue-crawler-has-finished-running

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!