Help recover a broken juju controller upgrade

I upgraded my juju controller to 2.8.0 and completely lost the controller. I need to recover these models and I need some help to do that. I’m perfectly fine with the solution being to remove this controller and add it again. My only goal at this point is recovering my models.

During the upgrade to 2.8.0, something went horribly wrong and my configuration file didn’t end up with apiserver addresses inside and things seemed to fall over. In my attempt to fix this, I realized that I know very little about what is going on with the controller and how it works.

For example, I thought maybe upgrading to 2.8.1 could help, but I looked for a jujud snap and apt package and didn’t find anything. I’m not sure how the controller runs and what is supposed to run outside of the mongod process, which is running.

The log file seems unhelpful for me to get hints for google, but that could be my ignorance again. I see this on startup now:

2020-08-19 02:49:23 INFO juju.cmd supercommand.go:91 running jujud [2.8.0 0 d816abe62fbf6787974e5c4e140818ca08586e44 gc go1.14.4]
2020-08-19 02:49:23 DEBUG juju.cmd supercommand.go:92   args: []string{"/var/lib/juju/tools/machine-0/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "0", "--debug"}
2020-08-19 02:49:23 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 2
2020-08-19 02:49:23 DEBUG juju.agent agent.go:575 read agent config, format "2.0"
2020-08-19 02:49:23 INFO juju.cmd.jujud agent.go:138 setting logging config to "<root>=WARNING;unit=DEBUG"
2020-08-19 02:49:24 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [21a522] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-08-19 02:49:27 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [21a522] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-08-19 02:49:31 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [21a522] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-08-19 02:49:35 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [21a522] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-08-19 02:49:40 ERROR juju.worker.apicaller connect.go:204 Failed to connect to controller: invalid entity name or password (unauthorized access)

It seems like some sort of auth issue, but is it auth with mongo? I see some keys in the agent config, but I’m not sure what they should match.

Is there a troubleshooting document that I haven’t found yet that explains the controller and what things should look like and how to validate startup of the various processes?

As an aside, the jujud comes from simplestreams, not a snap or apt.

Can you explain what you did to try and fix it? Did you edit /var/lib/juju/agents/machine-0/agent.conf?
It seems like it because 127.0.0.1 is not how the controller address is recorded. It’s more like:

apiaddresses:
- 10.115.246.59:17070
statepassword: NEIS4fa9yChAEcwJt2i6fAS6
apipassword: NEIS4fa9yChAEcwJt2i6fAS6
oldpassword: e1b360403d29b83b0b483f69e845c180

Plus there should also be a CA cert in that file.

You have taken a backup with juju backup right? :slight_smile: We can restore from that easily enough.

If not, we’d need to try and get the agent.conf file up to scratch, assuming it is currently wrong.
Depending on if the controller password has been lost, we’d need to consider some ugly database surgery, assuming you have not lost the db connection password from agent.conf.

1 Like

Thank you for the help, @wallyworld. I originally documented some of the steps I took in this thread, but the gist is that I had issues and a missing apiaddress in my agent.conf. I also seem to be missing the apipassword entry. I just set it to the same as the statepassword and it seems the controller is up, but none of the units are connected.

Looking at a random unit, I see logs with

2020-08-01 00:16:39 INFO juju.cmd supercommand.go:91 running jujud [2.8.0 0 d816abe62fbf6787974e5c4e140818ca08586e44 gc go1.14.4]
2020-08-01 00:16:39 DEBUG juju.cmd supercommand.go:92   args: []string{"/var/lib/juju/tools/machine-17/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "17", "--debug"}
2020-08-01 00:16:39 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 24
2020-08-01 00:16:39 DEBUG juju.agent agent.go:575 read agent config, format "2.0"
2020-08-01 00:16:39 INFO juju.cmd.jujud agent.go:138 setting logging config to "<root>=WARNING;unit=DEBUG"
2020-08-01 00:16:39 ERROR juju.worker.apiconfigwatcher manifold.go:132 retrieving API addresses: No apidetails in config
2020-08-01 00:16:39 ERROR juju.worker.apiconfigwatcher manifold.go:132 retrieving API addresses: No apidetails in config
2020-08-01 00:16:39 ERROR juju.worker.apiconfigwatcher manifold.go:132 retrieving API addresses: No apidetails in config
2020-08-19 04:13:41 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [6bad3a] "machine-17" cannot open api: API info not available

Sure enough, this agent.conf has no apiserver or apipassword. I added those, but that doesn’t seem to have done the trick.

2020-08-19 04:44:03 INFO juju.cmd supercommand.go:91 running jujud [2.8.0 0 d816abe62fbf6787974e5c4e140818ca08586e44 gc go1.14.4]
2020-08-19 04:44:03 DEBUG juju.cmd supercommand.go:92   args: []string{"/var/lib/juju/tools/machine-17/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "17", "--debug"}
2020-08-19 04:44:03 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 24
2020-08-19 04:44:03 DEBUG juju.agent agent.go:575 read agent config, format "2.0"
2020-08-19 04:44:03 INFO juju.cmd.jujud agent.go:138 setting logging config to "<root>=WARNING;unit=DEBUG"
2020-08-19 04:44:03 ERROR juju.worker.apicaller connect.go:204 Failed to connect to controller: invalid entity name or password (unauthorized access)

So at this point I’m able to juju status again! progress!

Each unit agent has its own agent.conf file with its own password.
The api address value will be the same across all units, but the password will be different for each.

Are you able to add back the apipassword for each unit agent conf file?

The same applies for each machine agent.conf file too. Each agent is identified by the tag in the file (eg unit-foo-0 or machine-6) and the apipassword. These need to match what Juju has stored in mongo for that entity (the stored password will be a hash).

:frowning_face: This isn’t filling me with warm fuzzies, but let us see what we can figure out. So my agent.conf files are somehow missing entries. Here is an example:

$ sudo cat agent.conf
# format 2.0
tag: machine-31
datadir: /var/lib/juju
transient-datadir: /var/run/juju
logdir: /var/log/juju
metricsspooldir: /var/lib/juju/metricspool
nonce: machine-0:f2b7332f-5d03-4a61-84f9-966dd8ba37f0
jobs:
- JobHostUnits
upgradedToVersion: 2.8.0
cacert: |
  -----BEGIN CERTIFICATE-----
  <redacted>
  -----END CERTIFICATE-----
controller: controller-ee84341b-ee88-4788-80a4-a8319fa3d904
model: model-6bad3a62-8bdc-4f6d-83fc-48aff1ebd0e3
oldpassword: <redacted>
loggingconfig: <root>=WARNING;unit=DEBUG
values:
  AGENT_SERVICE_NAME: jujud-machine-31
  CONTAINER_TYPE: ""
  PROVIDER_TYPE: maas
mongoversion: "0.0"

How in the world could I end up losing the apipassword and the apiservers from this config? If I have a backup(I believe I do), could I find and extract that information from it and inject that into the configuration files on the units? Is the backup a tgz or anything easily readable?

Edit:
Seems the backup is just the controller information. Is there a way I can generate these unit passwords again? Could I do something like add the machine again in order to get a password generated? Could I manually generate one and then hash it myself and write the hash into the database?

Off hand, I have no idea how the api server address info and apipassword went missing. A recent 2.8 PR added a change to not write out empty addresses, as the issue has not been able to be observed “in the lab”. I don’t think though this affects empty apipasswords (first we’ve heard of it). It’s been very illusive to try and track down the smoking gun here.

The password in mongo is stored as a sha512 hash

func AgentPasswordHash(password string) string {
	sum := sha512.New()
	sum.Write([]byte(password))
	h := sum.Sum(nil)
	return base64.StdEncoding.EncodeToString(h[:18])
}

so you could generate a new hash of a new password that you put into agent.conf and update mongo (the “passwordhash” field on the releavant doc in the “units” or “machines” collection).

Thank you so much, @wallyworld. You helped me recover this cluster! For anyone else who may stumble upon this thread with much the same issue. This is what I did.

  1. generate new passwords and hashes with the following go program(I just ran it in my browser at https://play.golang.org/
package main

import (
	"fmt"
	"crypto/sha512"
	"encoding/base64"
	"time"
   	"math/rand"
)

func AgentPasswordHash(password string) string {
	sum := sha512.New()
	sum.Write([]byte(password))
	h := sum.Sum(nil)
	return base64.StdEncoding.EncodeToString(h[:18])
}

var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")

func randSeq(n int) string {
	b := make([]rune, n)
	for i := range b {
		b[i] = letters[rand.Intn(len(letters))]
	}
	return string(b)
}

func main() {
	rand.Seed(time.Now().UnixNano())
	units := [15]string{"17", "17/lxd/0", "17/lxd/1", "18", "18/lxd/0", "22", "22/lxd/0", "22/lxd/1", "22/lxd/2", "22/lxd/3", "22/lxd/5", "23", "23/lxd/0", "31", "33"}
	var pws[15]string
	for i, _ := range units {
		pws[i] = randSeq(24)
	}
	
	fmt.Printf("db update commands:\n")
	for i, s := range units {
		fmt.Printf("db.machines.update({'machineid':'%s'},{$set:{'passwordhash':'%s'}})\n", s, AgentPasswordHash(pws[i]))
	}

	fmt.Printf("host update commands:\n")
	for i, s := range units {
		fmt.Printf("unit %s:\n grep -q apipassword /var/lib/juju/agents/machine-*/agent.conf && sed -i 's/^apipassword:.*/apipassword: %s/' /var/lib/juju/agents/machine-*/agent.conf || echo 'apipassword: %s' >> /var/lib/juju/agents/machine-*/agent.conf\n", s, pws[i], pws[i])
	}
}

Set the array values of machine ids to what you need. This generates new passwords and dumps out the db commands to update the hashes. It also dumps out a command to update the machine apipassword.

I had to get into mongo on the controller, I found the following script to do that:

#!/bin/bash

machine="${1:-0}"
model="${2:-controller}"
juju=$(command -v juju)

read -d '' -r cmds <<'EOF'
conf=/var/lib/juju/agents/machine-*/agent.conf
user=$(sudo awk '/tag/ {print $2}' $conf)
password=$(sudo awk '/statepassword/ {print $2}' $conf)
client=$(command -v mongo)
"$client" 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password"
EOF

"$juju" ssh -m "$model" "$machine" "$cmds"

Once I ran that script, I was able to do the following:

use juju
<paste db script output>

Then I needed to update the individual unit files. I found that juju run wouldn’t work, I assume because it is asking the agent to run something for it and not sshing inside the machines. So I had to manually juju ssh into each machine and copy/paste the grep line from the script. I just sudo su'd to root so I wouldn’t have to worry about trying to get sudo correct in the script. That script ensures that there is an apipassword field and it is updated. I then found that I had to edit each of the agent files to check for passwords and apiservers sudo vi /var/lib/juju/agents/*/agent.conf. I found that the machines were usually missing both passwords and apiservers, but the script added the passwords back, so I just had to copy the apiserver portion into each one:

apiaddresses:
- 10.0.4.194:17070

This could have been scripted, but I needed to check each one anyway. I found that the unit agent.conf files were about 70% missing the apiservers, but all still had their passwords. This was great because I didn’t have to generate anything to smash passwords in the db for the units.

Then I had to restart all the juju agents.

sudo systemctl restart jujud-machine-17.service
sudo systemctl restart jujud-unit-*.service

At this point the agents are all connected and juju status is happy! This was quite an ordeal, but I feel like it’s less voodoo and more juju now in my head. Now I will go back to administering the cluster and upgrade my k8s. Thank you again, @wallyworld. You really saved me a lot of pain. I have a ceph cluster inside here now and the thought of losing it or trying to find a way to back it up so I could nuke it and re-commission these machines in MaaS wasn’t appealing.

7 Likes

So glad you got it all recovered. Thanks for grabbing my hints and talking the time to write up in detail what you did. I for one really appreciate the effort and time taken to help others.

2 Likes

As an update here, it has all fallen over again in the same way. My controller is unable to come online right now and I’m fighting that. The workers are all missing API server addresses again as is the controller. I updated the controller with API information and it was unable to connect because it had no password either. I couldn’t find one saved back, so I did the mongo trick detailed above of smashing in a new password. This seems to have worked, but I’m now running into

2020-10-16 05:43:04 ERROR juju.worker.dependency engine.go:671 "state" manifold worker returned unexpected error: cannot log in to admin database as "machine-0": unauthorized mongo access: server returned error on SASL authentication step: Authentication failed.

This is interesting because I’m completely able to log into mongo with the password from the agent.conf file. I’m not sure what to do at this point. I’m debating trying to back up my ceph cluster and just nuke this thing from orbit.

Hi Mike,
Sorry to hear it came up again. This PR

https://github.com/juju/juju/pull/11854

was landed late July and would have been in the 2.8.2 release. It is meant to log a warning if the api addresses become empty and also skip writing them out.

I think you were originally running 2.8.0. So an upgrade to the latest 2.8 should help.

I was on 2.8.1 or 2.8.2, but I couldn’t recover it enough to upgrade. I gave up and cut my losses. That install had been with me for a few years so it had some baggage like stuck models with cross model relations that wouldn’t release. I’ve nuked it and set it up again. So far it is going well. Sorry I couldn’t help track this down any further.

By the way, the juju codebase does have a helper script:
https://github.com/juju/juju/blob/develop/scripts/generate-password/main.go

(it is go, so it needs to be compiled with ‘go build’ or ‘go install’)

but it lets you run something like:

$ ./generate-password dead-beef-12345 machine-0
oldpassword: m0zFH6iaofjnQcDoITEFClMp
db.machines.update({"_id": "dead-beef-12345:machine-0"}, {$set: {"passwordhash": "lCr7hf2n/TQAYCAIG221/cZ4"}})

The first line is the content to put in agent.conf, the second is the mongodb query to run to update the database with the right content.
You would need to get the model-uuid from something like agent.conf.

We do keep that in our source tree, and we do make sure it compiles, though I don’t think we have a CI test that it can replace passwords, we are using the underlying functions that the code base uses.

1 Like

Just a clarification on the example given, I ran the command like this:

./generate-password 2f161965-5889-xxxx-xxxx-1cd20bf10c6a unit-name/0

The example with “dead-beef” and “machine-0” was a little confusing because it made appear like the model ID needed the word “model-” in front of it (it does not, it is prefixed like that in the agents.conf file but you only need the numeric part) and that the machine name was needed there instead of the unit name.

And you still have to edit the agent.conf file for the machine (not the unit) and remove the stopped-units line, otherwise it will be ignored during the restart even though you changed the password in the db and the unit agent.conf file.

I’ve had much success with the procedure above. I’ve had to do it for units as well as machines.

For unit:

./juju-generate-password 5e6635fe-ec57-45d8-87d6-c713064112d9 launchpad-mailman/0

For machines:

./juju-generate-password 4d00e10a-0c0c-430c-8fee-7ece72545b99 2
./juju-generate-password 4d00e10a-0c0c-430c-8fee-7ece72545b99 3

Where the version I’m using doesn’t allow me to specify machine-X but just the machine ID.

1 Like

Are there any compilation instructions for the helper script?

I’ve snap installed go, but I’m not sure how to build.

UPDATE:

I’ve managed to build it with:

sudo snap install go
cd juju/scripts/generate-password
go env -w GO111MODULE=auto
go mod init
go build

Oh, and the output if different depending on if its a “machine index” vs “unit”:

Machine example

./generate-password c85701ca-70ff-405c-8b94-b7876019051b 0
oldpassword: Y28UuIGfHnmFe7yFXGPbyVB1
db.machines.update({"_id": "c85701ca-70ff-405c-8b94-b7876019051b:0"}, {$set: {"passwordhash": "KW8kh10S3xt38vdeOnlZNtZm"}})

Unit example

./generate-password c85701ca-70ff-405c-8b94-b7876019051b tiny-bash/0
oldpassword: O2TzsmkMXbLQh3rzjFKdk1PW
db.units.update({"_id": "c85701ca-70ff-405c-8b94-b7876019051b:tiny-bash/0"}, {$set: {"passwordhash": "18IHRWBejMJZNOXMfBfDljhj"}})
1 Like

Got the same issue after moving containers to another pool. Juju version 2.9.42

But it was quite a smooth procedure thanks to all the tips written here =)