Have a Java app with auto-scaling on App Engine Standard Environment. Right now the scaling is configured like this:
F2
it seems like the newly-created instance (created when load increases) starts getting all the incoming requests, while the resident instance sits with a very light load.
My mental model is that Resident instances and warm up request are only useful when the boot time of your GAE instance is large. (I'm not sure if that's the intent, but that's the behavior I've observed)
Namely, traffic is sent to resident instances while the new instances are being booted (and other dynamic instances can't handle it). Once the new instance is up and running, traffic gets routed to it, instead of the resident instance.
Which means that if your instance boot time is low, then the resident instances won't be doing much work. An F2 can boot up in ~250ms (by my testing), so if your average response latency is 2000ms, then the new dynamic instance will have been booted completely before the resident instance finishes handling the request. As such, it'll be ready to handle subsequent requests instead of the resident one.
This appears to follow the behavior pattern you're seeing.
You might be able to confirm this by looking at how stackdriver and logging separate out your response time vs boot time. If boot time is really small, then resident instances might not help you much.
but GAE's load balancer chooses to send all requests to the instance with the highest latency!
Sadly there's not much info around how GAE decides which instance to send new packets to. All I've found is How instances are managed and scheduling settings which talks more about the params on when to boot new instances or not.
I know it's not the question you asked, but the 2000ms response time might be contributing to the issue here? If your min-pending-latency is set to 2000, then new requests will sit in the queue for 2000ms before a new instance will be spawned. But if it's being serviced in a serial fashion (threadsafe off) then response times that land between 1500 and 2000 would still be serviced properly.
I would suggest turning on threadsafe to see if that helps the scenario, and also add some custom tracing incase the code is doing something odd that you don't have visibility to.