'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)

前端 未结 4 1428
温柔的废话
温柔的废话 2020-12-03 08:09

My question (to MS and anyone else) is: Why is this issue occurring and what work around can be implemented by the users / customers themselves as opposed to

相关标签:
4条回答
  • 2020-12-03 08:29

    Workaround 1 (May Not Work for Everyone)

    An interesting solution (worked for me) to test is scaling the number of nodes in your cluster up, and then back down...

    1. Log into the Azure Console — Kubernetes Service blade.
    2. Scale your cluster up by 1 node.
    3. Wait for scale to complete and attempt to connect (you should be able to).
    4. Scale your cluster back down to the normal size to avoid cost increases.

    Alternately you can (maybe) do this from the command line:

    az aks scale --name <name-of-cluster> --node-count <new-number-of-nodes> --resource-group <name-of-cluster-resource-group>

    Since it is a finicky issue and I used the web interface I am uncertain if the above is identical or would work.

    Total time it took me ~2 minutes — for my situation that is MUCH better than re-creating / configuring a Cluster (potentially multiple times...)

    That being Said....

    Zimmergren brings up some good points that Scaling is not a true Solution:

    "It worked sometimes, where the cluster self-healed a period after scaling. It failed sometimes with the same errors. I don't consider scaling a solution to this problem, as that causes other challenges depending on how things are set up. I wouldn't trust that routine for a GA workload, that's for sure. In the current preview, it's a bit wild west (and expected), and I'm happy to blow up the cluster and create a new one when this fails continuously." (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)

    Azure Support Feedback

    Since I had a support ticket open at the time I ran into the above scaling solution I was able to get feedback (or rather a guess) on what the above might have worked, here's a paraphrased response:

    "I know that scaling the cluster can sometimes help if you get into a state where the number of nodes is mismatched between “az aks show” and “kubectl get nodes”. This may be similar."

    Workaround References:

    1. GitHub user Scaled nodes from console and fixed the problem: https://github.com/Azure/AKS/issues/268#issuecomment-375722317

    Workaround Didn't Work?

    If this DOES NOT work for you, please post a comment below as I am going to try to keep an up to date list of how often the issue crops up, whether it resolves itself, and whether this solution works across Azure AKS users (looks like it doesn't work for everyone).

    Users Scaling Up / Down DID NOT work for:

    1. omgsarge (https://github.com/Azure/AKS/issues/112#issuecomment-395231681)
    2. Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
    3. sercand — scale operation itself failed — not sure if it would have impacted connectability (https://github.com/Azure/AKS/issues/268#issuecomment-395301296)

    Scaling Up / Down DID work for:

    1. Me
    2. LohithChanda (https://github.com/Azure/AKS/issues/268#issuecomment-395207716)
    3. Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)

    Email Azure AKS Specific Support

    If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help@service.microsoft.com

    0 讨论(0)
  • 2020-12-03 08:30

    Adding another answer since this is now the Azure Support official solution when the above attempts do not work. I haven't experienced the issue in a while so I can't verify this one but it seems like it would make sense to me (based on previous experience).

    Credit on this one / full thread found here (https://github.com/Azure/AKS/issues/14#issuecomment-424828690)

    Check for Tunneling Issues

    1. ssh to the agent node which running the tunnelfront pod
    2. get tunnelfront logs: "docker ps" -> "docker logs "
    3. "nslookup " whose fqdn can be get from above command -> if it resolves ip, which means dns works, then go to the following step
    4. "ssh -vv azureuser@ -p 9000" ->if port is working, go to the next step
    5. "docker exec -it /bin/bash", type "ping google.com", if it is no response, which means tunnel front pod doesn't have external network, then do following step
    6. restart kube-proxy, using "kubectl delete po -n kube-system", choose the kube-proxy which is runing on the same node with tunnelfront. customer can use "kubectl get po -n kube-system -o wide"

    I feel like this particular work-around could PROBABLY be automated (for sure on Azure side but probably on the community side).

    Email Azure AKS Specific Support

    If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help@service.microsoft.com

    0 讨论(0)
  • 2020-12-03 08:45

    We just had this issue for one of our clusters. Sent a support ticket and got called back 5 minutes later by an engineer asking if it was OK for them to restart the API Server. 2 minutes later it was working again.

    Reason was something about timeouts in their messaging queue.

    0 讨论(0)
  • 2020-12-03 08:50

    Workaround 2 Re-Create Cluster (Somewhat Obvious)

    I am adding this one because there are some details to keep in mind and even though I touched on it in my original Question, that thing got long, so I am adding specific details on re-creation here.

    Cluster Re-Creation Doesn't Always Work

    Per the above in my original question there are multiple AKS Server instances that divide up responsibilities for a given Azure region (we think). Some, or all, of these can be impacted by this bug resulting in your Cluster being un-reachable via Kubectl.

    That means that if you re-create your Cluster and it some how lands on the same AKS server, probably that new Cluster will ALSO not be reachable requiring...

    Additional Re-creation Attempts

    Probably re-creating multiple times will result in you eventually landing your new Cluster on one of the other AKS servers (which is working fine). As far as I can tell I don't see any indication that ALL AKS servers get hit with this problem at once in a while (if ever).

    Different Cluster Node Size

    If you are in a pinch and want the highest possibly probability (we haven't confirmed this) that your re-creation lands on a different AKS management server — choose a different Node size when you create your new Cluster (see Node Size section of the initial Question above).

    I have opened this ticket asking Azure DevOps whether or not the Node Size is ACTUALLY related to deciding which Clusters are administered by which AKS management servers: https://github.com/Azure/AKS/issues/416

    Support Ticket Fix vs. Self Healing

    Since there are a lot of users who indicate that the problem occasionally solves itself and just goes away I think that it is reasonable to guess that Support actually fixes the offending AKS server (which may result in other users having their Clusters fixed — 'Self Heal') as opposed to fixing the individual user's Cluster.

    Creating Support Tickets

    To me the above would likely mean that creating a Ticket is probably a good thing since it would fix other user Clusters experiencing the same issue — it might also be an argument for allowing support issue severity escalation for this specific issue.

    I think this is also a decent indicator that maybe Azure support hasn't figured out how to fully alarm for the problem yet, in which case creation of a support ticket serves that purpose as well.

    I also asked Azure DevOps whether they Alarm for the issue (based on my experience easily visualizing the issue based on CPU and Network IO metric changes) on their side: https://github.com/Azure/AKS/issues/416

    If NOT (haven't heard back) then it makes sense to create a ticket EVEN IF you plan to re-create your cluster since that ticket would then make Azure DevOps aware of the issue resulting in a fix for other users on that Cluster management server.

    Things to make Cluster Re-Creation Easier

    I will add to this (feedback / ideas are appreciated) but off the top of my head:

    1. Be diligent (obvious) about how you store all YAML files used to create your Cluster (even if you don't re-deploy often for your app by design).
    2. Script your DNS modifications in order to speed up pointing to the new instance — If you have a public facing app / service that utilizes DNS (Maybe something like this example for Google Domains?: https://gist.github.com/cyrusboadway/5a7b715665f33c237996, Full docs here: https://cloud.google.com/dns/api/v1/)
    0 讨论(0)
提交回复
热议问题