How to remove a failed mysql-innodb-cluster instance?

Having some trouble with a new (PoC, MAAS + Juju) OpenStack installation using the mysql-innodb-cluster charm.

There are three database servers with fixed IPv4 addresses. The hardware RAID on one of them failed, causing the OS to be mounted RO. This caused the charm to elect a new leader, as the failed server was leader at the time. The juju-agent was lost and never recovered. The two other servers lived on. The unit and machine were forcibly removed from Juju, after which the charm went into a blocked state. Of course, a cluster has to have 3 members.

The RAID-card was replaced, the disks were wiped, the server was re-Ready in MAAS and being naĆÆve I thought I could do an ā€˜add-unitā€™ and the server would be added to the cluster, overriding the old server at that address.

The add-unit did deploy, but now the charm is in an error state because the new unit is not in a cluster. Running the action ā€˜cluster-statusā€™ on one of the other units reveals that the address is still in use. Ok, so the next step would be removing the instance with that address from the cluster I thought.
Using the action ā€˜juju run-action --wait mysql/1 remove-instance --string-args address=172.30.50.10ā€™ returns the following:

UnitId: mysql/1
id: "86"
message: Remove instance failed
results:
  output: |+
    Logger: Tried to log to an uninitialized logger.
    Traceback (most recent call last):
      File "<string>", line 3, in <module>
    SystemError: TypeError: Cluster.remove_instance: Option 'force' is expected to be of type Bool, but is Null

  traceback: |
    Traceback (most recent call last):
      File "/var/lib/juju/agents/unit-mysql-1/charm/actions/remove-instance", line 299, in remove_instance
        output = instance.remove_instance(address, force=force)
      File "/var/lib/juju/agents/unit-mysql-1/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 813, in remove_instance
        raise e
      File "/var/lib/juju/agents/unit-mysql-1/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 801, in remove_instance
        output = self.run_mysqlsh_script(_script).decode("UTF-8")
      File "/var/lib/juju/agents/unit-mysql-1/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 1436, in run_mysqlsh_script
        return subprocess.check_output(cmd, stderr=subprocess.PIPE)
      File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "/usr/lib/python3.8/subprocess.py", line 512, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['/snap/bin/mysqlsh', '--no-wizard', '--python', '-f', '/root/snap/mysql-shell/common/tmp_tidatm8.py']' returned non-zero exit status 1.
status: failed

A .yaml file with the --params gives the same result for a valid yaml boolean for force (true):
params:
address: ā€œ172.30.50.10ā€
force: true

Then I figured I could just do it by hand, but that requires knowing how to connect to the clusterā€¦ and where the mysql charm stores the password in /var/lib/mysql/mysql.passwd, I havenā€™t found an equivalent for this charm. The temporary file tmp_tidatm8.py is of course, temporary and I canā€™t check the values used in that.

Iā€™m a bit stuck here. And while my hamfisted forced removal of the unit and machine in Juju canā€™t have helped Iā€™m looking for a way to restore the cluster rather than do an entire redeploy.

Can anyone give some pointers about what to do better next time, and what to do now?

Looping in the @openstack-charmers folks who wrote the mysql-innodb-cluster charm who may have some advice here.

I managed to extract the credentials from the temporary file by placing a dummy Python file: /root/snap/mysql-shell/common/zzzz.py and watching the directory for change before running an action:

while $(true); do
  ls -d /root/snap/mysql-shell/common/*.py | entr -d cat /_
done;

This managed to output the first 2.5 lines of the script to stdout which contains the user and password to the cluster. After poking in the source of the charm, I found the remove_instance method does the following:

    _script = (
        "shell.connect('{user}:{pw}@{caddr}')\n"
        "cluster = dba.get_cluster('{name}')\n"
        "cluster.remove_instance('{user}@{addr}', {{'force': {force}}})"
        .format(
            user=self.cluster_user, pw=self.cluster_password,
            caddr=_primary or self.cluster_address,
            name=self.cluster_name, addr=address, force=force))

So I executed that in mysqlsh, to get an exception there this time:
mysqlsh: /build/mysql-shell/parts/mysql-shell/src/modules/adminapi/cluster/cluster_impl.cc:672: std::tuple<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >, bool> mysqlsh::dba::Cluster_impl::get_replication_user(const mysqlshdk::mysql::IInstance&) const: Assertion `!recovery_user.empty()ā€™ failed.
Aborted (core dumped)

Reconnected to the cluster and well, the topology hasnā€™t changed. (cluster.status()) still shows the instance MISSING. Soā€¦ I guess the mysqlsh is failing to find the recovery user? Ah wellā€¦

Looks like thereā€™s not really a way to remove the server from the topology.

[Edit]
After posting the above, I figured it was worth a shot to remove the address from the metadata table itself.

Like so:
shell.connect(ā€™{user}:{pw}@{caddr}ā€™)
\use mysql_innodb_cluster_metadata
\sql
DELETE FROM instances WHERE address = ā€˜172.30.50.10:3306ā€™;
DELETE FROM v2_instances WHERE address = ā€˜172.30.50.10:3306ā€™;
\use mysql
DELETE FROM user WHERE Host = ā€˜172.30.50.10ā€™;
DELETE FROM user WHERE User = ā€˜mysql_innodb_cluster_1000ā€™;
\exit

After issuing an add-unit and letting the charm reinstall on the server, I was able to execute a ā€˜cluster.add_instance(ā€˜clusteruser@172.30.50.10ā€™)ā€™ and the cluster reports as OK now.

I did find this in the install log from /var/log/juju/unit-mysql-11.log (I went through unit numbers 3 to 10 just trying different things):

2020-09-07 10:15:00 ERROR juju-log Cluster is unavailable: Logger: Tried to log to an uninitialized logger.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (1045): Shell.connect: Access denied for user 'clusteruser'@'172.30.50.10' (using password: YES)

2020-09-07 10:15:00 DEBUG jujuc server.go:211 running hook tool "juju-log"
2020-09-07 10:15:00 WARNING juju-log Cannot determine the cluster primary RW node for writes.

Not sure if manually adding an instance like this is going to have any side-effects, but there arenā€™t any plans to add more database servers to this environment.

I think the original problem, if I understand correctly, is that the new clean server is still at the same IP address as the old forcibly removed one.

Please use the remove-instance action with the IP of the old unit to clear the cluster metadata. Then re-deploy the unit.

Cluster status should show only the two functional servers before you proceed.

[0] https://github.com/openstack/charm-mysql-innodb-cluster/blob/master/src/actions.yaml#L45

1 Like

Hey @thedac, looks like your tip can help me too. I get unit removed and now IĀ“ve two units (one RW and another RO). IĀ“m trying to add a new unit to get 2 RO but the deployment stuck with ā€œMySQL InnoDB Cluster not healthy: Noneā€. Just on this Unit. The other two keeps RW and RO. Is there anything that IĀ“m missing on deployment or configuration?

This is the command that IĀ“ve used to deploy:

juju add-unit --to lxd:1 mysql-innodb-cluster