Controllers missing after upgrade-controller

I just upgraded a HA controller cluster 2.8.1 -> 2.8.2. The controllers never came back. 17070/tcp is closed, although jujud is running…

root@osjujud01:~# ps auxww | grep jujud
root      1101  0.0  0.0  21768  3428 ?        Ss   15:24   0:00 bash /etc/systemd/system/jujud-machine-1-exec-start.sh
root      1181  0.7  2.0 825632 82208 ?        Sl   15:24   0:52 /var/lib/juju/tools/machine-1/jujud machine --data-dir /var/lib/juju --machine-id 1 --debug
root      5935  0.0  0.0  14856  1036 pts/0    S+   17:14   0:00 grep --color=auto jujud
root@osjujud01:~# netstat -tlnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      698/systemd-resolve
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1081/sshd
tcp        0      0 0.0.0.0:37017           0.0.0.0:*               LISTEN      1108/mongod

The log has these messages;

2020-09-16 15:34:02 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [b5fb8e] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-09-16 15:35:05 ERROR juju.worker.upgradedatabase worker.go:305 timed out waiting for primary database upgrade
2020-09-16 15:35:51 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [b5fb8e] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-09-16 15:37:48 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [b5fb8e] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-09-16 15:39:44 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [b5fb8e] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-09-16 15:41:40 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [b5fb8e] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused

Any suggestions please? Many thanks in advance!

Can you look at the content of the /var/lib/juju/agents/machine-X/agent.conf files on the controllers and see what is listed under “apiaddresses”.

Was there anything else in the logs from the time upgrade was run?

Does juju status work? Can you also run juju show-controller and paste the output?

I found the upgrade event in the logs. There is a SEGV when jujud restarts after the upgrade.

I raised bug 1895954.

How to recover from this? I wonder…

This is severe; let me triage the bug and I can work with you on recovery.

1 Like

Hi @routergod.

Can you connect to your controller’s MongoDB (see this post) and run this query?

db.charms.find({meta: null}).pretty()

If you like we can iterate this faster in #juju on Freenode - I have the same user-name.

1 Like

Possible fix for getting the controller to progress

db.charms.update({meta: null}, { $set: {"meta": {}} }, false, true)

That isn’t the fix for the actual upgrade step (it should ignore these docs), but this prevents the nil pointer dereference that is causing the upgrade step to fail.

The upgrade failure happens when you have done a deploy or upgrade that failed to connect to the charm store, and leaves behind a “placeholder” record, which has a charm definition, but no “meta” information.

To fix the records in the controllers and avoid the nil pointer panic, you can do:

For all controller machines, SSH into them and run:

systemctl stop jujud-machine-*

Which should stop the controllers from trying to run the upgrade steps while we are updating the database.

Get access to Mongo:

agent=$(cd /var/lib/juju/agents; echo machine-*)
pw=$(sudo grep statepassword /var/lib/juju/agents/${agent}/agent.conf | cut '-d ' -sf2)
mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju

If you are in an HA controller, you will want to determine which machine is the Mongo Primary (it will have a prompt of:

juju:PRIMARY>

If you are not, the prompt should be:

juju:SECONDARY>

You can also use

rs.status()

And look for the

"members": [

with “stateStr” of “PRIMARY”, eg:

   "name" : "10.5.24.54:37017",
   "health" : 1,
   "state" : 1,
   "stateStr" : "PRIMARY",

From there you can run:

db.charms.find({meta: null}).count()

And see how many records should be affected. You can exit mongo to run:

mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju --eval 'db.charms.find({}).pretty()' > all_records.txt

To get a complete list of all charm records in the ‘all_records.txt’ file. And

mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju --eval 'db.charms.find({"meta": null}).pretty()' > null_records.txt

To get just the records that have a ‘null’ meta field.

And then

mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju --eval 'db.charms.update({meta: null}, { $set: {"meta": {}} }, false, true)'

Which will update the records with a nil meta to one with an empty meta, avoiding the nil pointer dereference.
You should see a line like:

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

Where the nModified matches the count() from earlier.

Once you have run the database updates, you can then do:

systemctl start jujud-machine-*

On all of the controllers and it should do the upgrade and progress as normal.

We have tested this workaround on 2 different controllers, and have also seen that it still allows users to issue an “juju upgrade-charm” for one of the previously-failed upgrades.

2 Likes

tl;dr

If you just want the simplest way to get things working, you can run the bash script from here:

Which will log you into mongo, and then run

db.charms.update({meta: null}, { $set: {"meta": {}} }, false, true)

Which will update the database to fix the null attributes, allowing the controller upgrade to proceed.

2 Likes

@jameinel @manadart awesome work and thank you very much

1 Like

I tried to follow @jameinel proposed workaround , sadly didn’t work for me.

ubuntu@juju-1:~$ netstat -tlnp
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:37017           0.0.0.0:*               LISTEN      -
tcp6       0      0 :::22                   :::*                    LISTEN      -
tcp6       0      0 :::37017                :::*                    LISTEN      -

I was upgrading from 2.8.10 -> 2.9.0 and one of the three ha members ran out of space.

Can anyone help me to try to recover my controller?

Hi @angelvargas. Probably the best place to get live feedback is at https://chat.charmhub.io our public Mattermost server. I don’t think you encountered the same issue that is being described in this thread, but we can certainly try to assist.