These are some notes on leadership in Juju, from the perspective of someone operating Juju deployed applications.
Charm writers can find more information on Leadership in the following pages:
Determining leadership
juju status
In the default (tabular) format, leadership is indicated by a *
after the unit name, e.g.:
Unit Workload Agent Machine Public address Ports Message
keystone/0* active idle 21/lxd/3 10.191.0.104 5000/tcp Unit is ready
hacluster-keystone/2 active idle 10.191.0.104 Unit is ready and clustered
keystone/1 active idle 22/lxd/3 10.191.0.115 5000/tcp Unit is ready
hacluster-keystone/0 active idle 10.191.0.115 Unit is ready and clustered
keystone/2 active idle 23/lxd/3 10.191.0.114 5000/tcp Unit is ready
hacluster-keystone/1* active idle 10.191.0.114 Unit is ready and clustered
Or in YAML format, it is indicated by the ‘leader’ key with the value ‘true’:
units:
keystone/0:
workload-status:
current: active
message: Unit is ready
since: 01 Aug 2020 04:06:10Z
juju-status:
current: idle
since: 02 Aug 2020 22:25:09Z
version: 2.5.8
leader: true
machine: 21/lxd/3
open-ports:
- 5000/tcp
public-address: 10.191.0.104
juju run
juju run
provides two ways to access leader information.
The first is via APPLICATION/leader as a unit name argument, e.g.:
juju run -u keystone/leader hostname
The second is through the is-leader tool:
juju run -a keystone is-leader
- Stdout: |
True
UnitId: keystone/0
- Stdout: |
False
UnitId: keystone/1
- Stdout: |
False
UnitId: keystone/2
Implementation
Leadership is implemented in Juju via Raft run on the Controller agents. It uses the Hashicorp implementation of Raft with BoltDB as the LogStore and SnapStore.
Details
Everything related to Raft is stored on the controllers in /var/lib/juju/raft
tree /var/lib/juju/raft
/var/lib/juju/raft
├── logs
└── snapshots
├── 162-123443686-1596405425726
│ ├── meta.json
│ └── state.bin
└── 162-123452624-1596407728185
├── meta.json
└── state.bin
3 directories, 5 files
The Raft consensus log is stored in /var/lib/juju/raft/logs
.
There are 2 snapshots of in /var/lib/juju/raft/snapshots
. In normal operations these two snapshots are approximately 30-35 minutes apart in age (mtime) and the newest should be no more than 30-35 minutes old.
On stable deployments (i.e. without much leader churn), it’s possible that the UNIX timestamp in the snapshots directory name can be many months old; that’s normal.
MongoDB
Although the Raft cluster is authoritative for determining who the leader is, the leadership data is also stored in MongoDB in the leaseholders
collection:
juju:PRIMARY> db.leaseholders.find( { _id: /#keystone#/ } )
{ "_id" : "4ee83d92-d15a-4f0b-8c1a-0ed4157df8cc:application-leadership#keystone#", "namespace" : "application-leadership", "model-uuid" : "4ee83d92-d15a-4f0b-8c1a-0ed4157df8cc", "lease" : "keystone", "holder" : "k
eystone/2", "txn-revno" : NumberLong(2), "txn-queue" : [ ] }
juju:PRIMARY>
Pinning
Juju has the ability to ‘pin’ the leader of an application. This is used by Juju itself in the context of series upgrades. When Juju is asked to ‘prepare’ a machine for series, Juju will pin the leadership of all leader units on that machine. When you tell Juju the series upgrade is ‘complete’, it will unpin any leader units on that machine.
Today, pinning is only visible in the Raft cluster and no sanity checking (e.g. whether or not the unit still exists) of the pins is done. Pinning combined with several bugs in Juju before 2.8.1 makes it possible to end up with ‘ghost leader’ units which manifests as multi-unit applications with no visible leader.
Inspecting leadership
Unfortunately, today Juju provides no way to introspect the state of Leadership or the health of the Raft cluster.
A bug has been filed asking for a ‘show-leader’ command (or something similar) to provide such functionality.
In the meantime, the easiest way to determine the state of leadership in Juju is to look at the most recent ‘state.bin’ file - despite its name, it is plain text.
There are 3rd party tools that can be used to examine the raft consensus log directly, however they require the machine agent on the controller be stopped, so that they can run.
When in dire straits, you can also use the strings
tool (from the binutils
package) on the live consensus log.
Logging
Relevant logging to leadership can be found in several places:
- On controllers:
/var/log/juju/lease.log
- On controllers:
/var/log/juju/machine-*.log
- Suggested search term: ‘raft’
- On units:
/var/log/juju/unit-*.log
- Suggest search terms: ‘leader’ or the word ‘lease’
Recovery
In normal operation, you should never need to recover or otherwise mess with the Leadership Raft cluster. If the Raft consensus log is deleted from one or more nodes, Juju will simply restore from the last snapshot.
Due to a bug, Juju is unable to recover the Raft cluster if the controllers are running in HA mode.
Forcing a leadership change
In general Juju Leadership is designed to be an automated system that does not need operator intervention.
However, in the event of Juju bugs it’s possible (for example) to end up with the leadership owned by a ghost unit.
At time of writing, no CLI or API exists to force a leadership change, but a bug has been filed asking for this (as the alternative is to hack snapshots and force a restore from the which is obviously not ideal.)
Controller config
The undocumented non-synced-writes-to-raft-log
controller config option will, if set, disable fsync calls after raft log writes. This can help in situations where the controller does not have sufficient IO capacity to keep up, but should not normally be needed.