Google App Engine deployment fails because of failing readiness check

早过忘川 提交于 2020-07-09 11:51:14

问题


A custom app engine environment fails to start up and it seems to be due to failing health checks. The app has a few custom dependencies (e.g. PostGIS, GDAL) so a few layers on top of the app engine image. It builds successfully and it runs locally in a Docker container.

ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.

The Dockerfile looks as follows (Note: no CMD as entrypoint is defined in docker-compose.yml and app.yaml):

FROM gcr.io/google-appengine/python
ENV PYTHONUNBUFFERED 1
ENV DEBIAN_FRONTEND noninteractive

RUN apt -y update && apt -y upgrade\
    && apt-get install -y software-properties-common \
    && add-apt-repository -y ppa:ubuntugis/ppa \
    && apt -y update \
    && apt-get -y install gdal-bin libgdal-dev python3-gdal  \ 
    && apt-get autoremove -y \
    && apt-get autoclean -y \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

ADD requirements.txt /app/requirements.txt
RUN python3 -m pip install -r /app/requirements.txt 
ADD . /app/
WORKDIR /app

This unfortunately creates an image of a whopping 1.58GB, but the original gcr.io python image starts at 1.05GB, so I don't think the size of the image would or should be a problem.

Running this locally with the following docker-compose.yml config beautifully spins up a container in no time:

version: "3"
services:
  web:
    build: .
    command: gunicorn gisapplication.wsgi --bind 0.0.0.0:8080

So, I would have expected the following yaml.app would do the trick:

runtime: custom
env: flex
entrypoint: gunicorn -b :$PORT gisapplication.wsgi

beta_settings:
    cloud_sql_instances: <sql-db-connection>

runtime_config:
    python_version: 3

No luck. So, as per error above, it seemed to have something to do with the readiness check. Tried increasing the timeout for the app to start (15 mins!) There seemed to have been some issues with health checks previously and rolling back to legacy health checks is not a solution as of Sept 2019.

readiness_check:
    path: "/readiness_check"
    check_interval_sec: 10
    timeout_sec: 10
    failure_threshold: 3
    success_threshold: 3
    app_start_timeout_sec: 900

liveness_check:
    path: "/liveness_check"
    check_interval_sec: 60
    timeout_sec: 4
    failure_threshold: 3
    success_threshold: 2
    initial_delay_sec: 30

Split health checks are definitely on. The output from gcloud beta app describe is:

authDomain: gmail.com
codeBucket: staging.proj-id-000000.appspot.com
databaseType: CLOUD_DATASTORE_COMPATIBILITY
defaultBucket: proj-id-000000.appspot.com
defaultHostname: proj-id-000000.ts.r.appspot.com
featureSettings:
  splitHealthChecks: true
  useContainerOptimizedOs: true
gcrDomain: asia.gcr.io
id: proj-id-000000
locationId: australia-southeast1
name: apps/proj-id-000000
servingStatus: SERVING

That didn't work, so also tried to increase the resources available to the instance and allocated the maximum amount of memory for 1 CPU (6.1GB):

resources:
    cpu: 1
    memory_gb: 6.1
    disk_size_gb: 10

Just to be on the safe side, I added health check endpoints to the app (legacy health checks and the split health checks) - it's a Django app, so this went into the project's urls.py:

path(r'_ah/health/', lambda r: HttpResponse("OK", status=200)),
path(r'readiness_check/', lambda r: HttpResponse("OK", status=200)),
path(r'liveness_check/', lambda r: HttpResponse("OK", status=200)),

So, when I dive into the logs, there seems to be a successful request to /liveness_check from a curl user agent, but the subsequent requests to /readiness_check from GoogleHC agent return a 503 (Service Unavailable)

Shortly after (after 8 failed requests - why 8?) a shutdown trigger seems to be sent of:

2020-07-05 09:00:02.603 AEST Triggering app shutdown handlers.

Any ideas of what is going on here? I think I've pretty much exhausted the options to fix this problem and wonder whether the time wouldn't have been better invested in getting things up and running in Compute/EC2.

ADDENDUM:

in addition to the SO issue linked, I've gone through issues on Google (here and here)


回答1:


You are sending the readiness check to path: "/readiness_check", but your url handler for that is path(r'readiness_check/'...)

Note trailing slash in the handler. Remove that (or add a trailing slash in the path for readiness_check:) and see if that fixes it. I would think that would give you a 404, but you are getting a 503 which tells me that you may have a more serious error. Click one of the arrows at the left of a 503 in the console, and see what the error message is. You may need to search in the console for traceback to see it.



来源:https://stackoverflow.com/questions/62735687/google-app-engine-deployment-fails-because-of-failing-readiness-check

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!