I have an ES cluster with 4 nodes:
number_of_replicas: 1
search01 - master: false, data: false
search02 - master: true, data: true
search03 - master: false,
I had two indices with unassigned shards that didn't seem to be self-healing. I eventually resolved this by temporarily adding an extra data-node[1]. After the indices became healthy and everything stabilized to green, I removed the extra node and the system was able to rebalance (again) and settle on a healthy state.
It's a good idea to avoid killing multiple data nodes at once (which is how I got into this state). Likely, I had failed to preserve any copies/replicas for at least one of the shards. Luckily, Kubernetes kept the disk storage around, and reused it when I relaunched the data-node.
...Some time has passed...
Well, this time just adding a node didn't seem to be working (after waiting several minutes for something to happen), so I started poking around in the REST API.
GET /_cluster/allocation/explain
This showed my new node with "decision": "YES"
.
By the way, all of the pre-existing nodes had "decision": "NO"
due to "the node is above the low watermark cluster setting"
. So this was probably a different case than the one I had addressed previously.
Then I made the following simple POST[2] with no body, which kicked things into gear...
POST /_cluster/reroute
Other notes:
Very helpful: https://datadoghq.com/blog/elasticsearch-unassigned-shards
Something else that may work. Set cluster_concurrent_rebalance
to 0
, then to null
-- as I demonstrate here.
[1] Pretty easy to do in Kubernetes if you have enough headroom: just scale out the stateful set via the dashboard.
[2] Using the Kibana "Dev Tools" interface, I didn't have to bother with SSH/exec shells.