Leadership in Juju - Operations Perspective

elmo · 3 August 2020 01:30

These are some notes on leadership in Juju, from the perspective of someone operating Juju deployed applications.

Charm writers can find more information on Leadership in the following pages:

Determining leadership

juju status

In the default (tabular) format, leadership is indicated by a * after the unit name, e.g.:

Unit                          Workload  Agent  Machine   Public address  Ports          Message
keystone/0*                   active    idle   21/lxd/3  10.191.0.104    5000/tcp       Unit is ready
  hacluster-keystone/2        active    idle             10.191.0.104                   Unit is ready and clustered
keystone/1                    active    idle   22/lxd/3  10.191.0.115    5000/tcp       Unit is ready
  hacluster-keystone/0        active    idle             10.191.0.115                   Unit is ready and clustered
keystone/2                    active    idle   23/lxd/3  10.191.0.114    5000/tcp       Unit is ready
  hacluster-keystone/1*       active    idle             10.191.0.114                   Unit is ready and clustered

Or in YAML format, it is indicated by the ‘leader’ key with the value ‘true’:

    units:
      keystone/0:
        workload-status:
          current: active
          message: Unit is ready
          since: 01 Aug 2020 04:06:10Z
        juju-status:
          current: idle
          since: 02 Aug 2020 22:25:09Z
          version: 2.5.8
        leader: true
        machine: 21/lxd/3
        open-ports:
        - 5000/tcp
        public-address: 10.191.0.104

juju run

juju run provides two ways to access leader information.

The first is via APPLICATION/leader as a unit name argument, e.g.:

juju run -u keystone/leader hostname

The second is through the is-leader tool:

juju run -a keystone is-leader
- Stdout: |
    True
  UnitId: keystone/0
- Stdout: |
    False
  UnitId: keystone/1
- Stdout: |
    False
  UnitId: keystone/2

Implementation

Leadership is implemented in Juju via Raft run on the Controller agents. It uses the Hashicorp implementation of Raft with BoltDB as the LogStore and SnapStore.

Details

Everything related to Raft is stored on the controllers in /var/lib/juju/raft

tree /var/lib/juju/raft
/var/lib/juju/raft
├── logs
└── snapshots
    ├── 162-123443686-1596405425726
    │   ├── meta.json
    │   └── state.bin
    └── 162-123452624-1596407728185
        ├── meta.json
        └── state.bin

3 directories, 5 files

The Raft consensus log is stored in /var/lib/juju/raft/logs.

There are 2 snapshots of in /var/lib/juju/raft/snapshots. In normal operations these two snapshots are approximately 30-35 minutes apart in age (mtime) and the newest should be no more than 30-35 minutes old.

On stable deployments (i.e. without much leader churn), it’s possible that the UNIX timestamp in the snapshots directory name can be many months old; that’s normal.

MongoDB

Although the Raft cluster is authoritative for determining who the leader is, the leadership data is also stored in MongoDB in the leaseholders collection:

juju:PRIMARY> db.leaseholders.find( { _id: /#keystone#/ } )
{ "_id" : "4ee83d92-d15a-4f0b-8c1a-0ed4157df8cc:application-leadership#keystone#", "namespace" : "application-leadership", "model-uuid" : "4ee83d92-d15a-4f0b-8c1a-0ed4157df8cc", "lease" : "keystone", "holder" : "k
eystone/2", "txn-revno" : NumberLong(2), "txn-queue" : [ ] }
juju:PRIMARY>

Pinning

Juju has the ability to ‘pin’ the leader of an application. This is used by Juju itself in the context of series upgrades. When Juju is asked to ‘prepare’ a machine for series, Juju will pin the leadership of all leader units on that machine. When you tell Juju the series upgrade is ‘complete’, it will unpin any leader units on that machine.

Today, pinning is only visible in the Raft cluster and no sanity checking (e.g. whether or not the unit still exists) of the pins is done. Pinning combined with several bugs in Juju before 2.8.1 makes it possible to end up with ‘ghost leader’ units which manifests as multi-unit applications with no visible leader.

Inspecting leadership

Unfortunately, today Juju provides no way to introspect the state of Leadership or the health of the Raft cluster.

A bug has been filed asking for a ‘show-leader’ command (or something similar) to provide such functionality.

In the meantime, the easiest way to determine the state of leadership in Juju is to look at the most recent ‘state.bin’ file - despite its name, it is plain text.

There are 3rd party tools that can be used to examine the raft consensus log directly, however they require the machine agent on the controller be stopped, so that they can run.

When in dire straits, you can also use the strings tool (from the binutils package) on the live consensus log.

Logging

Relevant logging to leadership can be found in several places:

On controllers: /var/log/juju/lease.log
On controllers: /var/log/juju/machine-*.log
- Suggested search term: ‘raft’
On units: /var/log/juju/unit-*.log
- Suggest search terms: ‘leader’ or the word ‘lease’

Recovery

In normal operation, you should never need to recover or otherwise mess with the Leadership Raft cluster. If the Raft consensus log is deleted from one or more nodes, Juju will simply restore from the last snapshot.

Due to a bug, Juju is unable to recover the Raft cluster if the controllers are running in HA mode.

Forcing a leadership change

In general Juju Leadership is designed to be an automated system that does not need operator intervention.

However, in the event of Juju bugs it’s possible (for example) to end up with the leadership owned by a ghost unit.

At time of writing, no CLI or API exists to force a leadership change, but a bug has been filed asking for this (as the alternative is to hack snapshots and force a restore from the which is obviously not ideal.)

Controller config

The undocumented non-synced-writes-to-raft-log controller config option will, if set, disable fsync calls after raft log writes. This can help in situations where the controller does not have sufficient IO capacity to keep up, but should not normally be needed.

pedroleaoc · 7 April 2022 09:25

pedroleaoc · 14 October 2022 11:32