Juju controller mongodb journal fills disk

Title says it all really. I’ve only been using juju for a couple of weeks, running a bunch of tests and trying to diagnose errors, etc - not done any real work with it yet. The controller is installed on a physical MaaS server with a 240G disk, which is now completely full of juju’s mongodb journals.

Juju version: 2.7.6
Controller OS: Ubuntu 20.04
Installation method: juju bootstrap

What should I do to:

  1. Clean out the journal now to get juju started again? A lot of commands on the controller simply don’t work because the disk is full.
  2. Stop this from happening again?

OK I just completely rebuilt the system after this, wiped everything and boostrapped a controller on fresh, empty hardware.

4 days on from that, with only one deployment of charmed kubernetes in a pretty standard way… and good news: the controller’s disk didn’t fill up.

However, the controller is now virtually unresponsive. Not much activity on the processor (using htop), but it struggles to do anything - no Juju GUI, no response to the CLI, can’t nslookup or apt install, but can ping. Even sudo shutdown -r now wouldn’t work. So, I restarted the server with the handy MaaS IPMI control. After the restart, apt install and nslookup now work. mongod service starts up and starts using a decent amount of CPU and a tiny bit of IO (disk) activity.

sudo systemctl list-unit-files results in:

UNIT FILE                              STATE           VENDOR PRESET 
juju-clean-shutdown.service            enabled         enabled      
juju-db.service                        enabled         enabled      
jujud-machine-0.service                enabled         enabled  

sudo service jujud-machine-0 status gives:

● jujud-machine-0.service - juju agent for machine-0
     Loaded: loaded (/etc/systemd/system/jujud-machine-0.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 10:31:55 BST; 1h 10min ago
   Main PID: 779 (bash)
      Tasks: 12 (limit: 9374)
     Memory: 102.8M

sudo service juju-db status gives:

● juju-db.service - juju state database
     Loaded: loaded (/etc/systemd/system/juju-db.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 11:45:07 BST; 1s ago
   Main PID: 23566 (mongod)
      Tasks: 3 (limit: 9374)
     Memory: 36.0M

sudo service juju-clean-shutdown status gives:

● juju-clean-shutdown.service - Stop all network interfaces on shutdown
     Loaded: loaded (/etc/systemd/system/juju-clean-shutdown.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

But there’s still no response whatsoever from juju.

So I checked the logs journalctl -b, to find repeated WiredTiger panics:

read checksum error for 4096B block at offset 65536: calculated block checksum of 1624741532 doesn't match expected checksum of 1174969535
WT_SESSION.open_cursor: the process must exit and restart: WT_PANIC: WiredTiger library panic
May 14 10:32:22 pleach.tombull.com mongod.37017[769]: [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 366
May 14 10:32:22 pleach.tombull.com mongod.37017[769]: [initandlisten] 
                                                    ***aborting after fassert() failure

It seems like the juju-db was just repeatedly restarting mongod and ignoring the error.

So I ran sudo service juju-db stop and followed it with sudo mongod --dbpath /var/lib/juju/db --repair and then sudo service juju-db start.

Did the juju GUI magically start working? Unfortunately not yet. But after a quick restart (sudo shutdown -r now works fine now). Everything is back up and running fine.

So here are my questions:

  • Assuming this is either a hardware problem with my server or a problem with either focal or the compatibility of the juju controller with focal, how can I break down the controller and bootstrap a new controller on new hardware running bionic, maintaining the configuration and machines that I’ve already set up?
  • Are there better ways of diagnosing and fixing problems with juju controllers?

juju 2.7, and 2.8, will bootstrap bionic today by default.

Are you using a manual cloud? That could explain how you got a focal controller without using --bootstrap-series.

You could bootstrap with your MAAS cloud and migrate the model your current config are in to the new controller.

Well done for pushing through @tombull.

I don’t know to what extent Juju could operate if its underlying database were to become corrupted. Running mongod --repair was a very smart move.

Generally speaking though, controller management hasn’t been well documented. Canonical has lots of experience figuring these issues out, but we haven’t been proactive at sharing this knowledge. To improve the situation, we’re currently in the process of creating a troubleshooting guide. A section on interacting with MongoDB will be a very useful addition.

Initially, I used --bootstrap-series to get a focal controller. To fix this particular problem, I ended up creating a new controller without --bootstrap-series set, and just re-creating my infrastructure from scratch.

In terms of migrating models, like in the migrating models documentation, does that end up re-creating the infrastructure, or will it take over management of the existing infrastructure on the new controller?

Thanks for the encouragement! I’m looking forward to the future developement of juju - it seems like I jumped in at the deep end trying to get a custom kubernetes cluster running on MaaS / focal with a focal controller. It was just about working after several days of sustained effort, but I’ve stepped back to bionic and it seems much easier and more stable.

I agree that better documentation on troubleshooting would really help. I also mentioned on another thread that surfacing more pointers towards troubleshooting - just surfacing error messages from logs, for example - in the user interface (either CLI or GUI) would really really help the user experience.

When you migrate a model, the management of the model is moved from one controller to another. The model’s deployment is not re-created.