How do you remove subordinate charms in error state?

dvnt · 12 September 2020 16:20

I’m struggling to remove broken subordinates from one of my charms

I deployed filebeat and related it to mysql-innodb-cluster, but 2 out of the three failed to install.
So I tried to remove the unit but got removing unit filebeat/18 failed: unit “filebeat/18” is a subordinate

I also tried: juju remove-relation mysql-innodb-cluster:juju-info filebeat:beats-host ( also tried with –force and after that with –no-wait ) without success.

I even tried juju resolved filebeat/18 which just ends up back in error state

This is what I’m stuck with:

Unit                     Workload  Agent  Machine  Public address  Ports  Message
mysql-innodb-cluster/0   active    idle   0/lxd/2  192.168.51.17          Unit is ready: Mode: R/O
  filebeat/44            active    idle            192.168.51.17          Filebeat ready.
mysql-innodb-cluster/1*  active    idle   6        192.168.51.37          Unit is ready: Mode: R/W
  filebeat/18            error     idle            192.168.51.37          hook failed: "stop"
mysql-innodb-cluster/2   active    idle   7        192.168.51.36          Unit is ready: Mode: R/O
  filebeat/21            error     idle            192.168.51.36          hook failed: "stop"

erik-lonroth · 13 September 2020 05:12

I use a few trix personally. It’s not always easy for me.

Trick 1. “Remove the application”

juju remove-application
juju resolved unitinerror/0

Trick 2. ‘Patch unit locally and make the charm exit 0’

This trick is about editing the charm code directly on the host to force it exiting with a 0 code (sys.exit(0)) which let’s juju continue with its things which makes it possible to truly resolve the issue and upgrade your charm code or deployment. Exiting with a zero code is what does this trick.

This is roughly how:

juju ssh unitinerror/0
edit /var/lib/juju/... /charm/unitinerror/myerrorcode.py
exit 
juju resolved unitinerror/0

dvnt · 13 September 2020 16:27

@erik-lonroth thanks for that.

I pulled the ’ cattle not pets ’ card… and threw the whole model in the trash and rebuilt it.

It’s a bit heavy handed but there were other issues elsewhere and the rebuild effort was a lot less than poking the broken pieces with a stick.

That said having to resort to drastic actions like remove-application for stuck subordinates is pretty frightening when you think about more high impact subordinates like ovn-chassis or ha-cluster.

erik-lonroth · 13 September 2020 18:55

Totally. Moving about applications, units and machines is scary indeed.

I don’t think there are any shortcuts with juju here, apart from learning how to work with your system as whole and using juju to test and improve on that.

We have currently setup “reference models” or “staging models” where we can deploy a known working bundle and from that test out issues before we do anything to the production grade systems. We’re getting more and more confidence with this and I think that juju allows us at least to iterate on this process in a very “software-development-kind-of-process”.

I am still very interested in others experience running complex systems over time with juju. Questions such as:

Do you nuke your models when something goes wrong?
How do you handle processes of upgrading juju models?
How do you manage when something enters into error-states (like you described)?
How do you manage operating system upgrades?

Generally, alot is to be learned from working with juju over time…

dvnt · 13 September 2020 19:37

Do you nuke your models when something goes wrong?
Yes. It’s easier to deal with this way.
There is discipline to be learned here though. These days I treat everything as ephemeral. I fat fingered removing a dead controller node last week which left me in a pickle because I had worlds on models that were then inaccessible. Long story short, I bootstrapped a new set of controllers, and because my models are stored in Git as bundles I was able to deploy a replica environment. After which I backed up and restored the data into the replica environment. I then unregistered the old broken controller and blew up the straggling resources left behind by the forgotten controller.
I bet a lot of environments are just too big to do this kind of thing. But the ‘throw away when broken’ mentality has saved me a lot of time recently.
How do you handle processes of upgrading juju models?
Tried this in the lab and it ended in tears. I’m personally too scared, I’ll leave the bandage on this wound for another day.
How do you manage when something enters into error-states (like you described)?
Either try coax it to where it needs to be - restarting units, restarting services.
Other times poke it with a stick by removing relations and adding them again, or simply scale up, wait until it’s stable and the Shoot The Other Node In The Head (STONITH)
How do you manage operating system upgrades?
LOL

afreiberger · 14 September 2020 17:57

As a note, if you’re having a failing hook keeping your subordinate unit from removing cleanly, many times, you can skip the failing relation-departed or other failing hooks with “juju resolved --no-retry $unit” such as “juju resolved --no-retry filebeat/21”.

Using this along with juju debug-log -i filebeat/21 can help you watch for errors and allow juju to move past them.

Unfortunately, removing units rarely removes software, so you’d have to ensure you wanted to re-deploy the application anew to the same unit before using this method. Otherwise, you can rotate re-deploying new primary units while the relationship is broken to remove subordinates.

Along with this method, you can also use juju debug-hooks and once caught in the hook loop, you can just “exit 0” to mark the action as successful, though this is much the same as ‘juju resolved --no-retry’.

erik-lonroth · 14 September 2020 19:18

This is a new tool in my box for sure! I will try this.