Botched release upgrade. Can I recover?

routergod · 8 July 2020 10:15

I have been experimenting with release upgrades. I know this is unsupported, but bionic->focal, which often does not work so well Several times I have come across the issue where the command e.g.

juju upgrade-series 0 complete

never actually completes. In this case, machine 0 had a neutron-gateway on it, but it seems that the ntp charm was responsible for this failure;

:~$ juju ssh 0 tail /var/log/juju/unit-ntp-1.log
2020-07-07 16:35:53 DEBUG start Could not find platform independent libraries <prefix>
2020-07-07 16:35:53 DEBUG start Could not find platform dependent libraries <exec_prefix>
2020-07-07 16:35:53 DEBUG start Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
2020-07-07 16:35:53 DEBUG start Fatal Python error: Py_Initialize: Unable to get the locale encoding
2020-07-07 16:35:53 DEBUG start ModuleNotFoundError: No module named 'encodings'
2020-07-07 16:35:53 DEBUG start
2020-07-07 16:35:53 DEBUG start Current thread 0x00007fb8cb8fb740 (most recent call first):
2020-07-07 16:35:54 ERROR juju.worker.uniter.operation runhook.go:136 hook "start" (via explicit, bespoke hook script) failed: signal: aborted (core dumped)

So I get it doesn’t work yet, I don’t really care about that. The question is, can I recover from this somehow? As it stands the machine is still locked for the release upgrade and I can’t seem to do anything with it.

Any advice please?

manadart · 9 July 2020 10:47

Sorry to hear you are having trouble.

I came across this yesterday with another unit and am working on fixing it.

To unlock the machine, you can manually connect to the MongoDB (see here) and remove the machine’s series-upgrade lock from the machineUpgradeSeriesLocks collection. As to the charm issue itself, I am not yet sure as to a resolution.

routergod · 9 July 2020 11:58

Thank you this is very helpful.

I noticed the same problem with the placement charm, but I just rebuild that. This unit being on ‘tin’ would have been a bit harder, but you have saved me the hassle!

To the charm, I guessed that the unit has a virtualenv that does not survive the release upgrade?

afreiberger · 9 July 2020 14:46

One thing I’ve found useful for determining why something is not completing in the complete phase is to use the juju_machine_lock command to check what charm/hook is holding up the line, and then determining the issues it’s causing. It seems you were able to narrow it down to ntp charm failing to run it’s post-upgrade hook successfully.

Can you comment on which revision of the ntp charm you are running? I’m wondering if the ntp charm revision you’re running supports focal. One must be running cs:ntp-38 or cs:ntp-39 (and all subordinate charms you’re running should be checked for series support) before performing the focal upgrade on your principal unit machines.

Also, the workaround in this bug may help you to get through your woes.

You essentially need to delete the wheelhouse/.bootstrapped and .venv files/dirs in the charms affected and then force a bootstrap manually by calling the python methods used during the install hook of layer-basic.

afreiberger · 9 July 2020 14:47

If your bug was exhibited while using cs:ntp-39, please file a bug with the product here:

manadart · 9 July 2020 15:38

There will be some iterations around making series upgrades more robust, but this has landed in the tip of 2.8 and edge:

https://github.com/juju/juju/pull/11812

It better exposes the upgrade workflow to the operator via the machine’s status.

routergod · 10 August 2020 18:40

Sorry Drew I got distracted here. Yes it was with ntp-39…

Thanks so much for your insight.

routergod · 13 August 2020 08:31

I confirmed this is still an issue in ntp-41 and filed Bug #1891220 “upgrade-series complete fails bionic->focal” : Bugs : NTP Charm

timClicks · 16 August 2020 20:51

Thanks for investigating. Sorry that you’re affected by this.