问题
We have a GKE cluster with:
- master nodes with version 1.6.13-gke.0
- 2 node pools with version 1.6.11-gke.0
We have Stackdriver Monitoring and Logging activated.
On 2018-01-22, masters where upgraded by Google to version 1.7.11-gke.1.
After this upgrade, we have a lot of errors like these:
I 2018-01-25 11:35:23 +0000 [error]: Exception emitting record: No such file or directory @ sys_fail2 - (/var/log/fluentd-buffers/kubernetes.system.buffer..b5638802e3e04e72f.log, /var/log/fluentd-buffers/kubernetes.system.buffer..q5638802e3e04e72f.log)
I 2018-01-25 11:35:23 +0000 [warn]: emit transaction failed: error_class=Errno::ENOENT error="No such file or directory @ sys_fail2 - (/var/log/fluentd-buffers/kubernetes.system.buffer..b5638802e3e04e72f.log, /var/log/fluentd-buffers/kubernetes.system.buffer..q5638802e3e04e72f.log)" tag="docker"
I 2018-01-25 11:35:23 +0000 [warn]: suppressed same stacktrace
Those messages are flooding our logs ~ 25Gb of logs each day, and are generated by pods managed by a DaemonSet called fluentd-gcp-v2.0.9 .
We found that it's a bug fixed on 1.8 and backported to 1.7.12.
My questions are:
- Should we upgrade masters to version 1.7.12 ? Is it safe to do it? OR
- Is there any other alternative to test before upgrading?
Thanks in advance.
回答1:
First of all, the answer to question 2.
As alternatives we could have:
- filtered fluentd to ignore logs from fluentd-gcp pods OR
- deactivate Stackdriver monitoring and logging
To answer question 1:
We upgraded to 1.7.12 in a test environment. The process took 3 minutes. During this period of time, we could not edit our cluster nor access it with kubectl (as expected).
After the upgrade, we deleted all our pods called fluentd-gcp-* and the flood stopped instantly:
for pod in $(kubectl get pods -nkube-system | grep fluentd-gcp | awk '{print $1}'); do \
kubectl -nkube-system delete pod $pod; \
sleep 20; \
done;
来源:https://stackoverflow.com/questions/48442077/log-flood-after-master-upgrade-from-1-6-13-gke-0-to-1-7-11-gke-1