Unit relation data doesn't get removed from the relation when a unit departs the relation

I have found myself stuck here previously (filed bug), when using peer relations. We seem to be in a stalemate for the moment when it comes to figuring out how to remove a unit’s relation data from the relation.

What is the actual problem?

The relation data remains unchanged after a unit departs the relation.

What have I tried?

I have tried a few things to remove the departing unit’s relation data.

The first thing I had the inclination to do was to have the unit remove its own data on relation-departed

def _on_departed(self, event):
    del event.relation.data[self.model.unit]

which results in the following exception

TypeError: 'RelationData' object does not support item deletion

Secondly, from the departing unit, I tried to remove each item inside of the unit’s unit data.

import copy
def _on_departed(self, event):
    unit_data = copy.deepcopy(event.relation.data[self.model.unit])
    for k, _ in unit_data:
        del event.relation.data[self.model.unit][k]

The above code executes successfully and removes any data from the unit data, but, the unit still exists on the relation, even long after it goes away. After the code^ executes, the relation data for the unit is empty, but the unit still exists in the relation data, it’ just doesn’t carry anything anymore.

Trying to get relation data for a unit that has never existed on the relation, is a case we can expect to fail.

$ juju run --unit slurmd/10  \
    "relation-get --relation=22 ingress-address slurmd/11"
ERROR cannot read settings for unit "slurmd/11" in relation "slurmctld:slurmd slurmd:slurmd": unit "slurmd/11": settings not found

I would expect the same result if the unit data was queried for a unit that had previously existed, but was then removed.

Instead, the unit data remains on the relation after the unit is gone.

# The ingress address doesn't exist because we removed 
# it in the departed hook, but the key for the unit is still 
# in the unit data.
$ juju run --unit slurmd/10  \
    "relation-get --relation=22 ingress-address slurmd/9"

The user or charmer facing issue here is that we want to depend on the relation-changed event for connected units when a unit departs the other side of the relation. It makes sense to me, a unit is leaving the relation, its data should be removed, related units should have a relation-changed hook event where the relation data doesn’t contain the key of the unit that left the relation.

Does this make sense to anyone else?

What to do?

1 Like

I put this this pastebin together. I feel this exemplifies the problem at hand.

1 Like

You raise some good points here. From the perspective of Juju, setting the relation key to an empty string removes it. It’s probably an oversight not to support this functionality within the Operator Framework. @chipaca and the rest of the team will have more context there.

If setting it to the empty string actually removes it, I don’t see a reason for ops not to support deletion via doing just that. @jameinel could you confirm?

So we do support removing a key from relation data using del, and I think we support deleting it by setting it to the empty string. That isn’t the issue that @jamesbeedy is having, though.

I’m a little confused from the pastebin, because in that pastebin, both slurmd/1 and slurmd/2 are considered active, so I would expect slurmd/1 to see slurmd/2’s data.

From the original post, I thought the issue was just that the Operator Framework doesn’t properly remove related units from a relation as a result of something like ‘relation-departed’.

Isn’t this relation-departed (a hook should occur with the removed unit no longer existing).
It is entirely reasonable to say that we might have a bug in the Operator framework, where we might be caching the relation data for a given unit, and iterating over relation.units and/or relation.data[dead_unit] would be returning data that isn’t valid.

I’d be a bit surprised to have relation.units returning invalid information. AIUI it just iterates over Juju’s relation-list to figure out what units are part of the relation. And according to the rest of the discussion, that seems to be giving the correct values.

I don’t think we intend to support del relation.data[unit] because that would be removing the unit from the relation entirely, but the unit still exists, and the relation still exists. Which is very different from del relation.data[unit][key] which just deletes a single key (which I thought we already supported).
We could have del relation.data[unit] indicate that it should set everything to empty, but I think you can already do that with relation.data[unit] = {}. And I find the latter much more obvious as to what would be going on behind the scenes (if we don’t support that yet, I would be happy to have it added).

Just to confirm, I poked around my local test juju controller.


$ juju status --relations
Model    Controller  Cloud/Region  Version  SLA          Timestamp
default  lxd         lxd/default   2.8.1    unsupported  16:51:40+04:00

App  Version  Status  Scale  Charm            Store  Rev  OS      Notes
uo   18.04    active      2  ubuntu-operator  local    1  ubuntu
uoo  18.04    active      2  ubuntu-operator  local    2  ubuntu

Unit    Workload  Agent  Machine  Public address  Ports  Message
uo/2*   active    idle   3        10.5.24.63             load: 0.08 0.15 0.18
uo/3    active    idle   4        10.5.24.226            load: 0.06 0.14 0.18
uoo/0*  active    idle   3        10.5.24.63             load: 0.07 0.14 0.18
uoo/1   active    idle   4        10.5.24.226            load: 0.19 0.20 0.19

Machine  State    DNS          Inst id        Series  AZ  Message
3        started  10.5.24.63   juju-b676a4-3  bionic      Running
4        started  10.5.24.226  juju-b676a4-4  bionic      Running

Relation provider  Requirer     Interface    Type     Message
uo:peer            uo:peer      ubuntu-peer  peer
uo:ubuntu          uoo:other-u  ubuntu       regular
uoo:peer           uoo:peer     ubuntu-peer  peer

Now we just have uo/2 and uo/3.

$ juju run --unit uo/2 'relation-ids peer'
peer:0
$ juju run --unit uo/2 'relation-list -r peer:0'
uo/3
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/2'
egress-subnets: 10.5.24.63/32
ingress-address: 10.5.24.63
private-address: 10.5.24.63
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/3'
egress-subnets: 10.5.24.226/32
ingress-address: 10.5.24.226
private-address: 10.5.24.226
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/1'
ERROR cannot read settings for unit "uo/1" in relation "uo:peer": unit "uo/1": settings not found

So the unit can see its own data in the peer relation and can see the related unit, but not the unit which has been removed. Just to confirm:

$ juju add-unit uo --to 3
$ juju run --unit uo/2 'relation-list -r peer:0'
uo/3
uo/4
$ juju run --unit uo/4 'relation-set -r peer:0 foo=bar'
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/4'
egress-subnets: 10.5.24.63/32
foo: bar
ingress-address: 10.5.24.63
private-address: 10.5.24.63
$ juju remove-unit uo/4
removing unit uo/4
$ juju run --unit uo/2 'relation-list -r peer:0'
uo/3
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/4'
egress-subnets: 10.5.24.63/32
foo: bar
ingress-address: 10.5.24.63
private-address: 10.5.24.63

So this says there is a Juju bug, where deleting a unit does remove it from the relation (relation-list doesn’t show it anymore), but relation-get is still able to get access to that unit’s old relation data.

That might be a Caching coherency issue in the unit agent. Let me try restarting it and see if it can still see the data:

$ juju run --unit uo/2 'systemctl restart jujud-unit-uo-2'
action terminated
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/4'
egress-subnets: 10.5.24.63/32
foo: bar
ingress-address: 10.5.24.63
private-address: 10.5.24.63

And if we check in the Database:

juju:PRIMARY> db.settings.find({"_id": {"$regex": /.*r#0#.*/}}).pretty()
...
{
        "_id" : "1a914b3f-a6ed-4bd4-8992-733923b676a4:r#0#peer#uo/4",
        "model-uuid" : "1a914b3f-a6ed-4bd4-8992-733923b676a4",
        "settings" : {
                "egress-subnets" : "10.5.24.63/32",
                "private-address" : "10.5.24.63",
                "ingress-address" : "10.5.24.63",
                "foo" : "bar"
        },
        "version" : NumberLong(1),
        "txn-revno" : NumberLong(3),
        "txn-queue" : [
                "5f3143c0d07ccc0830cead0d_8a28b79e"
        ]
}
...

There is the relation data for the uo/4 peer in relation ‘0’.

So Juju isn’t deleting the relation data for units when they are removed from a relation.

Note that this isn’t specific to peer relations, I tried it with 2 applications over a normal provides/subscribes relation and even though ‘relation-list’ says the unit isn’t part of the relation, relation-get will return its old data:

$ juju run --unit uo/2 'relation-list -r 2'
uoo/0
uoo/1
$ juju run --unit uo/2 'relation-get -r 2 - uoo/2'
blah: blah
egress-subnets: 10.5.24.63/32
ingress-address: 10.5.24.63
private-address: 10.5.24.63

Anyway, in the Operator Framework, relation.units should have the correct set of units that are currently part of the relation (though potentially during ‘relation-departed’ while we tell you that unit/4 is gone, we might still report it as present in relation-list).

Relation-changed is always a change to one-particular-data-bag, so it doesn’t quite make sense to trigger relation-changed in response to a unit going away, but it does seem like maybe we should ensure that you can’t read relation data for a unit that no longer exists.

1 Like

In line 29 of the pastebin slurmd/2 is removed.

@jamesbeedy I’d like to concretely understand what the problem is. Namely, yes you can still read the old unit’s relation data if you know that unit existed.
However, given that ‘relation-list’ doesn’t tell you that unit exists, what is the advantage of having ‘relation-get’ just fail. How does your charm “know about” the unit that is no longer present? Is it just going through all possible unit ids and seeing if they have data?
Is there something else wrong where ‘relation-list’ is returning the wrong data? (Maybe during relation-departed we haven’t yet treated the unit as gone?)

I don’t see anything in: https://github.com/jamesbeedy/charm-peerer/blob/master/src/charm.py that cares about other unit’s hostnames. So I don’t see where it is getting the information that you think should no longer be available.

Note that I definitely think that we should be removing the relation data for a unit that is removed from the model. We probably didn’t catch this because if you iterate ‘relation-list’ you don’t see that the unit exist for you to find out that you had access to data that should have been removed. So I’d like to understand where it is a problem.

Gotcha, and I did confirm that behavior. But if ‘relation-list’ correctly omits it, I’m not sure “it matters” other than it is data that should be removed but isn’t, but it is only because you are recording a unit name outside of what Juju is telling you that you know to ask about its information.

I am trying to iterate over the units in the relations for the slurmd endpoint here. This is where the problem exposes itself in my charm code, I can’t get the updated list of active units. No matter how many units I remove, relation-changed is never fired, and the relation data never changes.

relation-changed isn’t fired for removed relations. relation-departed is, that is what -departed is for. In https://github.com/omnivector-solutions/slurm-charms/blob/238f9eb53aacd7778824c898ea551dc826e31df1/charm-slurmctld/src/slurmd_requires.py#L54 you look at relation-joined, relation-broken and relation-changed but never relation-departed.
If you want to know when to remove a unit from your data, that would be relation-departed. There may be an issue that ‘relation.units’ doesn’t notice that the unit is ‘going away’ during -departed.

@jameinel The relation data isn’t updated in departed or unless a unit is added, so departed doesn’t get me anywhere. I was trying to rely on the data changing to signal a config rewrite etc etc.

But the set of units does change/ the relation departed event itself tells you that unit foo/3 is going away.
Does ‘relation.units’ include the departed unit?
In my debug-hooks test, I see:

# relation-list -r 0
uo/3

And the related unit has been removed.
Which means that your line:
https://github.com/omnivector-solutions/slurm-charms/blob/238f9eb53aacd7778824c898ea551dc826e31df1/charm-slurmctld/src/slurmd_requires.py#L135

            for unit in relation.units:

Would no longer include the unit that you just removed.

And thus your later iteration here:
https://github.com/omnivector-solutions/slurm-charms/blob/238f9eb53aacd7778824c898ea551dc826e31df1/charm-slurmctld/src/slurmd_requires.py#L119-L121

        for node in self._slurmd_node_data:
            part_dict[node['partition_name']].setdefault('hosts', [])
            part_dict[node['partition_name']]['hosts'].append(node['hostname'])

Would also not include that node, and thus your hosts.append would not include that hostname.

I think it is just a case that you need to include relation-departed as the ‘way you find out that a unit is removed’ rather than assuming it would be a 'relation-changed ’ that now has no data available.

1 Like

This though …

juju deploy -n 3 percona-cluster
Model  Controller           Cloud/Region         Version  SLA          Timestamp
rats   localhost-localhost  localhost/localhost  2.8.1    unsupported  13:23:36Z

App              Version  Status  Scale  Charm            Store       Rev  OS      Notes
percona-cluster  5.7.20   active      3  percona-cluster  jujucharms  290  ubuntu

Unit                Workload  Agent  Machine  Public address  Ports     Message
percona-cluster/0*  active    idle   0        10.232.132.66   3306/tcp  Unit is ready
percona-cluster/1   active    idle   1        10.232.132.113  3306/tcp  Unit is ready
percona-cluster/2   active    idle   2        10.232.132.210  3306/tcp  Unit is ready

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.232.132.66   juju-850d2b-0  bionic      Running
1        started  10.232.132.113  juju-850d2b-1  bionic      Running
2        started  10.232.132.210  juju-850d2b-2  bionic      Running
juju remove-unit percona-cluster/1 percona-cluster/2
Model  Controller           Cloud/Region         Version  SLA          Timestamp
rats   localhost-localhost  localhost/localhost  2.8.1    unsupported  13:33:55Z

App              Version  Status   Scale  Charm            Store       Rev  OS      Notes
percona-cluster  5.7.20   blocked      1  percona-cluster  jujucharms  290  ubuntu

Unit                Workload  Agent  Machine  Public address  Ports     Message
percona-cluster/0*  blocked   idle   0        10.232.132.66   3306/tcp  MySQL is down. Sequence Number: 2. Safe To Bootstrap: 0

Machine  State    DNS            Inst id        Series  AZ  Message
0        started  10.232.132.66  juju-850d2b-0  bionic      Running
$ juju run --unit percona-cluster/0 'relation-get -r peer:0 - percona-cluster/0'
bootstrap-uuid: 23632a2d-dabd-11ea-ac52-4a122b54ef5e
cluster-address: 10.232.132.66
egress-subnets: 10.232.132.66/32
ingress-address: 10.232.132.66
private-address: 10.232.132.66
ready: "True"
bdx@rodent-01:~
$ juju run --unit percona-cluster/0 'relation-get -r peer:0 - percona-cluster/1'
bootstrap-uuid: 23632a2d-dabd-11ea-ac52-4a122b54ef5e
cluster-address: 10.232.132.113
egress-subnets: 10.232.132.113/32
ingress-address: 10.232.132.113
private-address: 10.232.132.113
ready: "True"
bdx@rodent-01:~
$ juju run --unit percona-cluster/0 'relation-get -r peer:0 - percona-cluster/2'
bootstrap-uuid: 23632a2d-dabd-11ea-ac52-4a122b54ef5e
cluster-address: 10.232.132.210
egress-subnets: 10.232.132.210/32
ingress-address: 10.232.132.210
private-address: 10.232.132.210

but the list of units (eg relation-list) is updated.

Again, I perfectly agree that returning the data about removed units is a bug. That said, ‘relation-list’ is how you should be finding out who your peers are, not blindly iterating the identities that you think are there. And that list is updated by the time you get to ‘relation-departed’. And it should be what drives for unit in relation.units:.

2 Likes

Got it. I was under the impression that I was accessing relation-list by iterating over relation.units, but I can see now that isn’t the case. I’ll try the alternate method you have suggested anyway.

Thanks @jameinel

this makes total sense :+1: