Help recover a broken juju controller upgrade

I upgraded my juju controller to 2.8.0 and completely lost the controller. I need to recover these models and I need some help to do that. I’m perfectly fine with the solution being to remove this controller and add it again. My only goal at this point is recovering my models.

During the upgrade to 2.8.0, something went horribly wrong and my configuration file didn’t end up with apiserver addresses inside and things seemed to fall over. In my attempt to fix this, I realized that I know very little about what is going on with the controller and how it works.

For example, I thought maybe upgrading to 2.8.1 could help, but I looked for a jujud snap and apt package and didn’t find anything. I’m not sure how the controller runs and what is supposed to run outside of the mongod process, which is running.

The log file seems unhelpful for me to get hints for google, but that could be my ignorance again. I see this on startup now:

2020-08-19 02:49:23 INFO juju.cmd supercommand.go:91 running jujud [2.8.0 0 d816abe62fbf6787974e5c4e140818ca08586e44 gc go1.14.4]
2020-08-19 02:49:23 DEBUG juju.cmd supercommand.go:92   args: []string{"/var/lib/juju/tools/machine-0/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "0", "--debug"}
2020-08-19 02:49:23 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 2
2020-08-19 02:49:23 DEBUG juju.agent agent.go:575 read agent config, format "2.0"
2020-08-19 02:49:23 INFO juju.cmd.jujud agent.go:138 setting logging config to "<root>=WARNING;unit=DEBUG"
2020-08-19 02:49:24 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [21a522] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-08-19 02:49:27 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [21a522] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-08-19 02:49:31 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [21a522] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-08-19 02:49:35 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [21a522] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-08-19 02:49:40 ERROR juju.worker.apicaller connect.go:204 Failed to connect to controller: invalid entity name or password (unauthorized access)

It seems like some sort of auth issue, but is it auth with mongo? I see some keys in the agent config, but I’m not sure what they should match.

Is there a troubleshooting document that I haven’t found yet that explains the controller and what things should look like and how to validate startup of the various processes?

As an aside, the jujud comes from simplestreams, not a snap or apt.

Can you explain what you did to try and fix it? Did you edit /var/lib/juju/agents/machine-0/agent.conf?
It seems like it because 127.0.0.1 is not how the controller address is recorded. It’s more like:

apiaddresses:
- 10.115.246.59:17070
statepassword: NEIS4fa9yChAEcwJt2i6fAS6
apipassword: NEIS4fa9yChAEcwJt2i6fAS6
oldpassword: e1b360403d29b83b0b483f69e845c180

Plus there should also be a CA cert in that file.

You have taken a backup with juju backup right? :slight_smile: We can restore from that easily enough.

If not, we’d need to try and get the agent.conf file up to scratch, assuming it is currently wrong.
Depending on if the controller password has been lost, we’d need to consider some ugly database surgery, assuming you have not lost the db connection password from agent.conf.

Thank you for the help, @wallyworld. I originally documented some of the steps I took in this thread, but the gist is that I had issues and a missing apiaddress in my agent.conf. I also seem to be missing the apipassword entry. I just set it to the same as the statepassword and it seems the controller is up, but none of the units are connected.

Looking at a random unit, I see logs with

2020-08-01 00:16:39 INFO juju.cmd supercommand.go:91 running jujud [2.8.0 0 d816abe62fbf6787974e5c4e140818ca08586e44 gc go1.14.4]
2020-08-01 00:16:39 DEBUG juju.cmd supercommand.go:92   args: []string{"/var/lib/juju/tools/machine-17/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "17", "--debug"}
2020-08-01 00:16:39 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 24
2020-08-01 00:16:39 DEBUG juju.agent agent.go:575 read agent config, format "2.0"
2020-08-01 00:16:39 INFO juju.cmd.jujud agent.go:138 setting logging config to "<root>=WARNING;unit=DEBUG"
2020-08-01 00:16:39 ERROR juju.worker.apiconfigwatcher manifold.go:132 retrieving API addresses: No apidetails in config
2020-08-01 00:16:39 ERROR juju.worker.apiconfigwatcher manifold.go:132 retrieving API addresses: No apidetails in config
2020-08-01 00:16:39 ERROR juju.worker.apiconfigwatcher manifold.go:132 retrieving API addresses: No apidetails in config
2020-08-19 04:13:41 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [6bad3a] "machine-17" cannot open api: API info not available

Sure enough, this agent.conf has no apiserver or apipassword. I added those, but that doesn’t seem to have done the trick.

2020-08-19 04:44:03 INFO juju.cmd supercommand.go:91 running jujud [2.8.0 0 d816abe62fbf6787974e5c4e140818ca08586e44 gc go1.14.4]
2020-08-19 04:44:03 DEBUG juju.cmd supercommand.go:92   args: []string{"/var/lib/juju/tools/machine-17/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "17", "--debug"}
2020-08-19 04:44:03 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 24
2020-08-19 04:44:03 DEBUG juju.agent agent.go:575 read agent config, format "2.0"
2020-08-19 04:44:03 INFO juju.cmd.jujud agent.go:138 setting logging config to "<root>=WARNING;unit=DEBUG"
2020-08-19 04:44:03 ERROR juju.worker.apicaller connect.go:204 Failed to connect to controller: invalid entity name or password (unauthorized access)

So at this point I’m able to juju status again! progress!

Each unit agent has its own agent.conf file with its own password.
The api address value will be the same across all units, but the password will be different for each.

Are you able to add back the apipassword for each unit agent conf file?

The same applies for each machine agent.conf file too. Each agent is identified by the tag in the file (eg unit-foo-0 or machine-6) and the apipassword. These need to match what Juju has stored in mongo for that entity (the stored password will be a hash).

:frowning_face: This isn’t filling me with warm fuzzies, but let us see what we can figure out. So my agent.conf files are somehow missing entries. Here is an example:

$ sudo cat agent.conf
# format 2.0
tag: machine-31
datadir: /var/lib/juju
transient-datadir: /var/run/juju
logdir: /var/log/juju
metricsspooldir: /var/lib/juju/metricspool
nonce: machine-0:f2b7332f-5d03-4a61-84f9-966dd8ba37f0
jobs:
- JobHostUnits
upgradedToVersion: 2.8.0
cacert: |
  -----BEGIN CERTIFICATE-----
  <redacted>
  -----END CERTIFICATE-----
controller: controller-ee84341b-ee88-4788-80a4-a8319fa3d904
model: model-6bad3a62-8bdc-4f6d-83fc-48aff1ebd0e3
oldpassword: <redacted>
loggingconfig: <root>=WARNING;unit=DEBUG
values:
  AGENT_SERVICE_NAME: jujud-machine-31
  CONTAINER_TYPE: ""
  PROVIDER_TYPE: maas
mongoversion: "0.0"

How in the world could I end up losing the apipassword and the apiservers from this config? If I have a backup(I believe I do), could I find and extract that information from it and inject that into the configuration files on the units? Is the backup a tgz or anything easily readable?

Edit:
Seems the backup is just the controller information. Is there a way I can generate these unit passwords again? Could I do something like add the machine again in order to get a password generated? Could I manually generate one and then hash it myself and write the hash into the database?

Off hand, I have no idea how the api server address info and apipassword went missing. A recent 2.8 PR added a change to not write out empty addresses, as the issue has not been able to be observed “in the lab”. I don’t think though this affects empty apipasswords (first we’ve heard of it). It’s been very illusive to try and track down the smoking gun here.

The password in mongo is stored as a sha512 hash

func AgentPasswordHash(password string) string {
	sum := sha512.New()
	sum.Write([]byte(password))
	h := sum.Sum(nil)
	return base64.StdEncoding.EncodeToString(h[:18])
}

so you could generate a new hash of a new password that you put into agent.conf and update mongo (the “passwordhash” field on the releavant doc in the “units” or “machines” collection).

Thank you so much, @wallyworld. You helped me recover this cluster! For anyone else who may stumble upon this thread with much the same issue. This is what I did.

  1. generate new passwords and hashes with the following go program(I just ran it in my browser at https://play.golang.org/
package main

import (
	"fmt"
	"crypto/sha512"
	"encoding/base64"
	"time"
   	"math/rand"
)

func AgentPasswordHash(password string) string {
	sum := sha512.New()
	sum.Write([]byte(password))
	h := sum.Sum(nil)
	return base64.StdEncoding.EncodeToString(h[:18])
}

var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")

func randSeq(n int) string {
	b := make([]rune, n)
	for i := range b {
		b[i] = letters[rand.Intn(len(letters))]
	}
	return string(b)
}

func main() {
	rand.Seed(time.Now().UnixNano())
	units := [15]string{"17", "17/lxd/0", "17/lxd/1", "18", "18/lxd/0", "22", "22/lxd/0", "22/lxd/1", "22/lxd/2", "22/lxd/3", "22/lxd/5", "23", "23/lxd/0", "31", "33"}
	var pws[15]string
	for i, _ := range units {
		pws[i] = randSeq(24)
	}
	
	fmt.Printf("db update commands:\n")
	for i, s := range units {
		fmt.Printf("db.machines.update({'machineid':'%s'},{$set:{'passwordhash':'%s'}})\n", s, AgentPasswordHash(pws[i]))
	}

	fmt.Printf("host update commands:\n")
	for i, s := range units {
		fmt.Printf("unit %s:\n grep -q apipassword /var/lib/juju/agents/machine-*/agent.conf && sed -i 's/^apipassword:.*/apipassword: %s/' /var/lib/juju/agents/machine-*/agent.conf || echo 'apipassword: %s' >> /var/lib/juju/agents/machine-*/agent.conf\n", s, pws[i], pws[i])
	}
}

Set the array values of machine ids to what you need. This generates new passwords and dumps out the db commands to update the hashes. It also dumps out a command to update the machine apipassword.

I had to get into mongo on the controller, I found the following script to do that:

#!/bin/bash

machine="${1:-0}"
model="${2:-controller}"
juju=$(command -v juju)

read -d '' -r cmds <<'EOF'
conf=/var/lib/juju/agents/machine-*/agent.conf
user=$(sudo awk '/tag/ {print $2}' $conf)
password=$(sudo awk '/statepassword/ {print $2}' $conf)
client=$(command -v mongo)
"$client" 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password"
EOF

"$juju" ssh -m "$model" "$machine" "$cmds"

Once I ran that script, I was able to do the following:

use juju
<paste db script output>

Then I needed to update the individual unit files. I found that juju run wouldn’t work, I assume because it is asking the agent to run something for it and not sshing inside the machines. So I had to manually juju ssh into each machine and copy/paste the grep line from the script. I just sudo su'd to root so I wouldn’t have to worry about trying to get sudo correct in the script. That script ensures that there is an apipassword field and it is updated. I then found that I had to edit each of the agent files to check for passwords and apiservers sudo vi /var/lib/juju/agents/*/agent.conf. I found that the machines were usually missing both passwords and apiservers, but the script added the passwords back, so I just had to copy the apiserver portion into each one:

apiaddresses:
- 10.0.4.194:17070

This could have been scripted, but I needed to check each one anyway. I found that the unit agent.conf files were about 70% missing the apiservers, but all still had their passwords. This was great because I didn’t have to generate anything to smash passwords in the db for the units.

Then I had to restart all the juju agents.

sudo systemctl restart jujud-machine-17.service
sudo systemctl restart jujud-unit-*.service

At this point the agents are all connected and juju status is happy! This was quite an ordeal, but I feel like it’s less voodoo and more juju now in my head. Now I will go back to administering the cluster and upgrade my k8s. Thank you again, @wallyworld. You really saved me a lot of pain. I have a ceph cluster inside here now and the thought of losing it or trying to find a way to back it up so I could nuke it and re-commission these machines in MaaS wasn’t appealing.

5 Likes

So glad you got it all recovered. Thanks for grabbing my hints and talking the time to write up in detail what you did. I for one really appreciate the effort and time taken to help others.

2 Likes