How to make GCE instance stop when its deployed container finishes?

三世轮回 提交于 2019-11-29 14:22:05

问题


I have a Docker container that performs a single large computation. This computation requires lots of memory and takes about 12 hours to run.

I can create a Google Compute Engine VM of the appropriate size and use the "Deploy a container image to this VM instance" option to run this job perfectly. However once the job is finished the container quits but the VM is still running (and charging).

How can I make the VM exit/stop/delete when the container exits?

When the VM is in its zombie mode only the stackdriver containers are left running:

$ docker ps
CONTAINER ID        IMAGE                                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
bfa2feb03180        gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1   "/entrypoint.sh /u..."   17 hours ago        Up 17 hours                             stackdriver-logging-agent
161439a487c2        gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2    "/bin/sh -c /opt/s..."   17 hours ago        Up 17 hours         8000/tcp            stackdriver-metadata-agent

I create the VM like this:

gcloud beta compute --project=abc instances create-with-container vm-name \
                    --zone=us-central1-c --machine-type=custom-1-65536-ext \
                    --network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
                    --maintenance-policy=MIGRATE \
                    --service-account=xyz \
                    --scopes=https://www.googleapis.com/auth/cloud-platform \
                    --image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
                    --boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
                    --container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
                    --container-command=python3 \
                    --container-arg="a" --container-arg="b" --container-arg="c" \
                    --labels=container-vm=cos-stable-69-10895-71-0

回答1:


When you create the VM, you'll need to give it write access to compute so you can delete the instance from within. You should also set container environment variables like gce_zone and gce_project_id at this time. You'll need them to delete the instance.

gcloud beta compute instances create-with-container {NAME} \
    --container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
    --service-account={SERVICE_ACCOUNT} \
    --scopes=https://www.googleapis.com/auth/compute,...
    ...

Then within the container, whenever YOU determine your task is finished:

  1. request an api token (im using curl for simplicity and DEFAULT gce service account)
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"

This will respond with json that looks like

{
  "access_token": "foobarbaz...",
  "expires_in": 1234,
  "token_type": "Bearer"
}
  1. Take that access token and hit the instances.delete api endpoint (notice the environment variables)
curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME



回答2:


I wrote a self-contained Python function based on Vincent's answer.

def kill_vm():
    """
    If we are running inside a GCE VM, kill it.
    """
    # based on https://stackoverflow.com/q/52748332/321772
    import json
    import logging
    import requests

    # get the token
    r = json.loads(
        requests.get("http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
                     headers={"Metadata-Flavor": "Google"})
            .text)

    token = r["access_token"]

    # get instance metadata
    # based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
    project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
                              headers={"Metadata-Flavor": "Google"}).text

    name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
                        headers={"Metadata-Flavor": "Google"}).text

    zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
                             headers={"Metadata-Flavor": "Google"}).text
    zone = zone_long.split("/")[-1]

    # shut ourselves down
    logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))

    requests.delete("https://www.googleapis.com/compute/v1/projects/{project_id}/zones/{zone}/instances/{name}"
                    .format(project_id=project_id, zone=zone, name=name),
                    headers={"Authorization": "Bearer {token}".format(token=token)})

A simple atexit hook gets me my desired behavior:

import atexit
atexit.register(kill_vm)



回答3:


Having grappled with the problem for some time, here's a full solution that works pretty well.

This solution doesn't use the "start machine with a container image" option. Instead it uses a startup script, which is more flexible. You still use a Container-Optimized OS instance instance.

  1. Create a startup script:
#!/usr/bin/env bash

# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")

CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")

# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker

# Run! The logs will go to stack driver 
sudo HOME=/home/root  docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}

# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
ZONE="${zoneMetadataSplit[3]}"

# Run compute delete on the current instance. Need to run in a container 
# because COS machines don't come with gcloud installed 
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME}  --delete-disks=all --zone=${ZONE}
  1. Put the script somewhere public. For example put it on Cloud Storage and create a public URL. You can't use a gs:// URI for a COS startup script.

  2. Start an instance using a startup-script-url, and passing the image name and parameters, e.g.:

gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME  \
--zone=ZONE --machine-type=TYPE \
--metadata=image_name=IMAGE_NAME,\
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME

(You probably want to limit the scopes, the example uses full access for simplicity)



来源:https://stackoverflow.com/questions/52748332/how-to-make-gce-instance-stop-when-its-deployed-container-finishes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!