How to remove a failed mysql-innodb-cluster instance?

Having some trouble with a new (PoC, MAAS + Juju) OpenStack installation using the mysql-innodb-cluster charm.

There are three database servers with fixed IPv4 addresses. The hardware RAID on one of them failed, causing the OS to be mounted RO. This caused the charm to elect a new leader, as the failed server was leader at the time. The juju-agent was lost and never recovered. The two other servers lived on. The unit and machine were forcibly removed from Juju, after which the charm went into a blocked state. Of course, a cluster has to have 3 members.

The RAID-card was replaced, the disks were wiped, the server was re-Ready in MAAS and being naïve I thought I could do an ‘add-unit’ and the server would be added to the cluster, overriding the old server at that address.

The add-unit did deploy, but now the charm is in an error state because the new unit is not in a cluster. Running the action ‘cluster-status’ on one of the other units reveals that the address is still in use. Ok, so the next step would be removing the instance with that address from the cluster I thought.
Using the action ‘juju run-action --wait mysql/1 remove-instance --string-args address=172.30.50.10’ returns the following:

UnitId: mysql/1
id: "86"
message: Remove instance failed
results:
  output: |+
    Logger: Tried to log to an uninitialized logger.
    Traceback (most recent call last):
      File "<string>", line 3, in <module>
    SystemError: TypeError: Cluster.remove_instance: Option 'force' is expected to be of type Bool, but is Null

  traceback: |
    Traceback (most recent call last):
      File "/var/lib/juju/agents/unit-mysql-1/charm/actions/remove-instance", line 299, in remove_instance
        output = instance.remove_instance(address, force=force)
      File "/var/lib/juju/agents/unit-mysql-1/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 813, in remove_instance
        raise e
      File "/var/lib/juju/agents/unit-mysql-1/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 801, in remove_instance
        output = self.run_mysqlsh_script(_script).decode("UTF-8")
      File "/var/lib/juju/agents/unit-mysql-1/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 1436, in run_mysqlsh_script
        return subprocess.check_output(cmd, stderr=subprocess.PIPE)
      File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "/usr/lib/python3.8/subprocess.py", line 512, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['/snap/bin/mysqlsh', '--no-wizard', '--python', '-f', '/root/snap/mysql-shell/common/tmp_tidatm8.py']' returned non-zero exit status 1.
status: failed

A .yaml file with the --params gives the same result for a valid yaml boolean for force (true):
params:
address: “172.30.50.10”
force: true

Then I figured I could just do it by hand, but that requires knowing how to connect to the cluster… and where the mysql charm stores the password in /var/lib/mysql/mysql.passwd, I haven’t found an equivalent for this charm. The temporary file tmp_tidatm8.py is of course, temporary and I can’t check the values used in that.

I’m a bit stuck here. And while my hamfisted forced removal of the unit and machine in Juju can’t have helped I’m looking for a way to restore the cluster rather than do an entire redeploy.

Can anyone give some pointers about what to do better next time, and what to do now?

Looping in the @openstack-charmers folks who wrote the mysql-innodb-cluster charm who may have some advice here.

I managed to extract the credentials from the temporary file by placing a dummy Python file: /root/snap/mysql-shell/common/zzzz.py and watching the directory for change before running an action:

while $(true); do
  ls -d /root/snap/mysql-shell/common/*.py | entr -d cat /_
done;

This managed to output the first 2.5 lines of the script to stdout which contains the user and password to the cluster. After poking in the source of the charm, I found the remove_instance method does the following:

    _script = (
        "shell.connect('{user}:{pw}@{caddr}')\n"
        "cluster = dba.get_cluster('{name}')\n"
        "cluster.remove_instance('{user}@{addr}', {{'force': {force}}})"
        .format(
            user=self.cluster_user, pw=self.cluster_password,
            caddr=_primary or self.cluster_address,
            name=self.cluster_name, addr=address, force=force))

So I executed that in mysqlsh, to get an exception there this time:
mysqlsh: /build/mysql-shell/parts/mysql-shell/src/modules/adminapi/cluster/cluster_impl.cc:672: std::tuple<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >, bool> mysqlsh::dba::Cluster_impl::get_replication_user(const mysqlshdk::mysql::IInstance&) const: Assertion `!recovery_user.empty()’ failed.
Aborted (core dumped)

Reconnected to the cluster and well, the topology hasn’t changed. (cluster.status()) still shows the instance MISSING. So… I guess the mysqlsh is failing to find the recovery user? Ah well…

Looks like there’s not really a way to remove the server from the topology.

[Edit]
After posting the above, I figured it was worth a shot to remove the address from the metadata table itself.

Like so:
shell.connect(’{user}:{pw}@{caddr}’)
\use mysql_innodb_cluster_metadata
\sql
DELETE FROM instances WHERE address = ‘172.30.50.10:3306’;
DELETE FROM v2_instances WHERE address = ‘172.30.50.10:3306’;
\use mysql
DELETE FROM user WHERE Host = ‘172.30.50.10’;
DELETE FROM user WHERE User = ‘mysql_innodb_cluster_1000’;
\exit

After issuing an add-unit and letting the charm reinstall on the server, I was able to execute a ‘cluster.add_instance(‘clusteruser@172.30.50.10’)’ and the cluster reports as OK now.

I did find this in the install log from /var/log/juju/unit-mysql-11.log (I went through unit numbers 3 to 10 just trying different things):

2020-09-07 10:15:00 ERROR juju-log Cluster is unavailable: Logger: Tried to log to an uninitialized logger.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (1045): Shell.connect: Access denied for user 'clusteruser'@'172.30.50.10' (using password: YES)

2020-09-07 10:15:00 DEBUG jujuc server.go:211 running hook tool "juju-log"
2020-09-07 10:15:00 WARNING juju-log Cannot determine the cluster primary RW node for writes.

Not sure if manually adding an instance like this is going to have any side-effects, but there aren’t any plans to add more database servers to this environment.

I think the original problem, if I understand correctly, is that the new clean server is still at the same IP address as the old forcibly removed one.

Please use the remove-instance action with the IP of the old unit to clear the cluster metadata. Then re-deploy the unit.

Cluster status should show only the two functional servers before you proceed.

[0] https://github.com/openstack/charm-mysql-innodb-cluster/blob/master/src/actions.yaml#L45