I’m currently doing some destructive stress testing in the lab before I push a design to a client.
I’ve bootstrapped 3 HA controllers in an Openstack cloud and I’m testing failures at the cloud layer.
In one test we killed the power to 2 compute nodes which in turn tears down 2 of the 3 Juju controllers.
Now according to the Juju docs (and general laws of computer clustering) the remaining controller should become master and there should be little to no service interruption, however what we’re seeing is that the remaining controllers API completely dies and in /var/log/juju/machine-n.log we see:
2020-03-10 05:19:11 ERROR juju.worker.dependency engine.go:671 "state" manifold worker returned unexpected error: no reachable servers 2020-03-10 05:19:54 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [5485ee] "machine-2" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
What are the rules around a valid and available ‘cluster’ in the Juju context?