Why do my google cloud compute instances always unexpectedly restart?

人走茶凉 提交于 2019-12-21 10:52:45

问题


Help! Help! Help!

It is really annoying and I almost cannot bear it anymore! I'm using google cloud compute engine instances but they often unexpectedly restart without any notification in advance. The restart of instances seems to happen randomly and I have no idea what's going wrong there! I'm pretty sure that the instances are been occupied (usage of CPUs > 50% and all GPUs are in use) when restart happens. Could anyone please tell me how to solve this problem? Thanks in advance!


回答1:


The issue is right here:

all GPUs are in use

If you check the official documentation about GPU:

GPU instances must terminate for host maintenance events, but can automatically restart. These maintenance events typically occur once per week, but can occur more frequently when necessary. You must configure your workloads to handle these maintenance events cleanly. Specifically, long-running workloads like machine learning and high-performance computing (HPC) must handle the interruption of host maintenance events. Learn how to handle host maintenance events on instances with GPUs.

This is because an instance that has a GPU attached cannot be migrated to another host for maintenance as it happens for the rest of the virtual machines. To get a physical GPU attached to the instance and bare metal performance you are using GPU passthrough , which sadly means if the host has to go through maintenance the VM is going down with it.




回答2:


This sounds like Preemptible VM instance.

Preemptible instances function like normal instances, but have the following limitations:

  • Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
  • Compute Engine always terminates preemptible instances after they run for 24 hours.

To check if your instance is preemptible using gcloud cli, just run

gcloud compute instances describe instance-name --format="(scheduling.preemptible)"

Result

scheduling:
  preemptible: false

change "instance-name" to real name.

Or simply via UI, click on compute instance and scroll down:

To check for system operations performed on your instance, you can review it using following command:

gcloud compute operations list 


来源:https://stackoverflow.com/questions/48475029/why-do-my-google-cloud-compute-instances-always-unexpectedly-restart

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!