Need help understanding HA controllers

I’m currently doing some destructive stress testing in the lab before I push a design to a client.
I’ve bootstrapped 3 HA controllers in an Openstack cloud and I’m testing failures at the cloud layer.

In one test we killed the power to 2 compute nodes which in turn tears down 2 of the 3 Juju controllers.
Now according to the Juju docs (and general laws of computer clustering) the remaining controller should become master and there should be little to no service interruption, however what we’re seeing is that the remaining controllers API completely dies and in /var/log/juju/machine-n.log we see:

2020-03-10 05:19:11 ERROR juju.worker.dependency engine.go:671 "state" manifold worker returned unexpected error: no reachable servers
2020-03-10 05:19:54 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [5485ee] "machine-2" cannot open api: unable to connect to API: dial tcp connect: connection refused

What are the rules around a valid and available ‘cluster’ in the Juju context?

I don’t know how Juju’s clustering works, but in my experience, it is actually most common in modern clustering technology, which usually utilizes the Raft algorithm for consensus, that you have to have a majority of the clusters online to maintain availability. This means that with 3 servers in the cluster, you would be able to lose one server.

You might try that and see if it works.

Indeed, general HA theory says that you can’t tell the difference between a net split where you have machines actively running, but not able to talk to each other (but possibly able to talk to clients), and machines actually being down. So imagine someone accidentally rewrote the firewall to just firewall the controllers from each other. If we allowed a single node to vote itself Master, then all 3 would vote themselves Master, and let you do whatever you wanted. And once that firewall rule was removed, you have no way to resolve the differences from each.
So the general rule is that you always have to have a majority of nodes that can talk to each other, because there can’t be 2 non-overlapping sets that both have a majority. So with HA=3 you can lose 1 machine, with HA=5 you can lose 2 machines, and still have a valid master.

If you do run into a situation where you have 2 machines that are dead, we have manual ways of recovering (generally you have to directly edit the database to remove the peers).

If, on the other hand, you’re doing a controlled teardown, as of juju 2.6 or so, we do support juju remove-machine 1 wait for that to settle (HA actually goes down to 1 with a ‘hot spare’ that can’t actually become master), juju remove-machine 0 and then the master end up on the final node.

The reason we go to HA=1 if you have 2 nodes is because neither node can vote itself master anyway, so if you were in HA=2, then if either node went down, you would lose master entirely. With HA=1, if the primary goes down, the secondary won’t nominate itself (though you can do it manually), but if the secondary went down, the primary would continue to function.

In our experience, HA=5 isn’t worth the overhead (every write becomes slower because it has to be replicated to more nodes before it is approved). So our recommended configuration is always HA=3. It lets you lose a machine, without causing disruption, which also gives you some time to replace it.

1 Like

@zicklag @jameinel Thanks for the great explanations.
I guess it all comes down to the fact that there’s no shared quorum device between the cluster members. The numbers then do make sense.
I did verify that bringing a second controller back online brings the cluster back up.
A cluster of 5 is seriously high overhead and doubt is something I’d ever do, but knowing about the slow replication is good to know :metal:

1 Like