JuJu fails to complete installation of Openvswitch bridge interfaces in OpenStack deployment

Hi!

We’ve been running an OpenStack environment for the last 2 and a half years with a few hiccups along the way, but mostly with little downtime. Recently we’ve been trying to add a new piece of hardware to the stack as a nova-compute node to provide more CPU cores and RAM to our VMs. Unfortunately, for some reason, the install is not going well.

We’re running Xenial/Queens with JuJu and MaaS for deployment/provisioning. We were running Xenial/Pike until December when we upgraded. We’re starting to suspect that the upgrade to Queens is what’s causing the trouble as we were able to add new hardware before the upgrade. We even went as far as removing one of our existing machines that was acting as a nova-compute node and tried adding it back to the stack and it too is now exhibiting the same problems as our new hardware.

The root cause of the problems seems to be with the neutron-openvswitch application. When we install the nova-compute charm via JuJu everything seems to go smoothly up until the (automatic) installation/configuration of the subordinate neutron-openvswitch charm. While watching the logs at a certain point during the install connectivity on our OpenStack admin network (10.10.30.0/24 on eno1) is lost. We’re able to force the installation to proceed a bit further by adding a second connection on eno2 (a different external network), but the loss of connectivity on eno1 remains and the compute service isn’t able to communicate with the rest of the stack.

Looking at our other compute nodes in the stack that are functional, it looks like the admin network bridge (br-eno1) is not being created by the neutron-openvswitch charm. Some part of the process looks like it’s taking down eno1 in preparation of creating the bridge, but then fails, leaving the machine unable to communicate on that interface with the rest of the stack.

None of our configuration has changed since the upgrade to Queens, but perhaps there is some deprecation or change to the default configuration that came along with the Pike -> Queens upgrade we are unaware of? We’ve read through the release notes but can’t seem to find anything that would explain this behavior.

Any help would be greatly appreciated. I’m including a few segments of log files I think are relevant below but can provide anything else that might be needed. Thanks in advance!

Broken server ifconfig

eno1      Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet addr:10.10.30.101  Bcast:10.10.30.255  Mask:255.255.255.0
          inet6 addr: fe80::4ed9:8fff:fec5:2e3/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:487314 errors:0 dropped:0 overruns:0 frame:0
          TX packets:91955 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:255807482 (255.8 MB)  TX bytes:6693026 (6.6 MB)
          Interrupt:17

eno2      Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet addr:10.189.134.103  Bcast:10.189.134.255  Mask:255.255.255.0
          inet6 addr: fe80::4ed9:8fff:fec5:2e4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:195386 errors:0 dropped:0 overruns:0 frame:0
          TX packets:89021 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:29175518 (29.1 MB)  TX bytes:37673375 (37.6 MB)
          Interrupt:18

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:181496 errors:0 dropped:0 overruns:0 frame:0
          TX packets:181496 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:22574807 (22.5 MB)  TX bytes:22574807 (22.5 MB)

lxdbr0    Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet6 addr: fe80::1/64 Scope:Link
          inet6 addr: fe80::b8c2:36ff:fe60:de08/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:650 (650.0 B)

Broken Server ovs-vsctl show

fc878983-8ae5-479f-999f-d809f5a2ba8f
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-data
        Port "eno1"
            Interface "eno1"
        Port br-data
            Interface br-data
                type: internal
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
    ovs_version: "2.9.5"

Working server ifconfig:

br-eno1   Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet addr:10.10.30.117  Bcast:10.10.30.255  Mask:255.255.255.0
          inet6 addr: fe80::1a66:daff:fe55:6bdc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9552045918 errors:0 dropped:4 overruns:0 frame:0
          TX packets:8731602524 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:25169343655058 (25.1 TB)  TX bytes:20302362419370 (20.3 TB)

eno1      Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet6 addr: fe80::1a66:daff:fe55:6bdc/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:27433132917 errors:0 dropped:821138 overruns:0 frame:0
          TX packets:25763792601 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:31217303277897 (31.2 TB)  TX bytes:26547305328673 (26.5 TB)
          Interrupt:18

eno2      Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet addr:10.189.134.118  Bcast:10.189.134.255  Mask:255.255.255.0
          inet6 addr: fe80::1a66:daff:fe55:6bdd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:23432963 errors:0 dropped:0 overruns:0 frame:0
          TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2858920977 (2.8 GB)  TX bytes:2404 (2.4 KB)
          Interrupt:19

eno3      Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:19

eno4      Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:16

gre_sys   Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet6 addr: fe80::d061:36ff:fecd:3bdf/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65000  Metric:1
          RX packets:1247735590 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1053172217 errors:0 dropped:8 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:934609315304 (934.6 GB)  TX bytes:1138575443474 (1.1 TB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:874404497 errors:0 dropped:0 overruns:0 frame:0
          TX packets:874404497 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:1422560696594 (1.4 TB)  TX bytes:1422560696594 (1.4 TB)

lxdbr0    Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet addr:10.0.216.1  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::d83b:4eff:fedb:7be0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:750 (750.0 B)

qbr267cccc8-45 Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          UP BROADCAST RUNNING MULTICAST  MTU:1458  Metric:1
          RX packets:257167 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:8981790 (8.9 MB)  TX bytes:0 (0.0 B)
.
.
.
.
tap267cccc8-45 Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet6 addr: fe80::fc16:3eff:fede:d180/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1458  Metric:1
          RX packets:4801309 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6300403 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:12100707022 (12.1 GB)  TX bytes:3222243030 (3.2 GB)
.
.
.
.
vethWY9OQC Link encap:Ethernet  HWaddr FF:FF:FF:FF:FF:FF (redacted)
          inet6 addr: fe80::fc50:b6ff:fe7a:2584/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:533168318 errors:0 dropped:0 overruns:0 frame:0
          TX packets:468982413 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:191221371188 (191.2 GB)  TX bytes:227602758832 (227.6 GB)

Working Server ovs-vsctl show

be5c20fd-46ef-4991-8dc3-3860944308e5
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-data
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "eno1"
            Interface "eno1"
                error: "could not add network device eno1 to ofproto (Device or resource busy)"
        Port "eno2"
            Interface "eno2"
        Port br-data
            Interface br-data
                type: internal
        Port phy-br-data
            Interface phy-br-data
                type: patch
                options: {peer=int-br-data}
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port "gre-0a0a1e7f"
            Interface "gre-0a0a1e7f"
                type: gre
                options: {df_default="true", in_key=flow, local_ip="10.10.30.117", out_key=flow, remote_ip="10.10.30.127"}
        Port "gre-0a0a1e74"
            Interface "gre-0a0a1e74"
                type: gre
                options: {df_default="true", in_key=flow, local_ip="10.10.30.117", out_key=flow, remote_ip="10.10.30.116"}
        Port "gre-0a0a1e76"
            Interface "gre-0a0a1e76"
                type: gre
                options: {df_default="true", in_key=flow, local_ip="10.10.30.117", out_key=flow, remote_ip="10.10.30.118"}
        Port br-tun
            Interface br-tun
                type: internal
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "qvo5560dd35-7e"
            tag: 2
            Interface "qvo5560dd35-7e"
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port "qvo97c660e7-e3"
            tag: 1
            Interface "qvo97c660e7-e3"
        Port "qvo44aeabe3-de"
            tag: 1
            Interface "qvo44aeabe3-de"
        Port "qvo267cccc8-45"
            tag: 1
            Interface "qvo267cccc8-45"
        Port "qvofdf0ce36-50"
            tag: 2
            Interface "qvofdf0ce36-50"
        Port "qvof193baf6-c0"
            tag: 1
            Interface "qvof193baf6-c0"
        Port "qvod9facd45-41"
            tag: 1
            Interface "qvod9facd45-41"
        Port "qvoeeab657c-df"
            tag: 1
            Interface "qvoeeab657c-df"
        Port "qvodd4b9252-e5"
            tag: 1
            Interface "qvodd4b9252-e5"
        Port br-int
            Interface br-int
                type: internal
        Port "qvoc841a7f1-25"
            tag: 2
            Interface "qvoc841a7f1-25"
        Port "qvod6b38e4c-a1"
            tag: 2
            Interface "qvod6b38e4c-a1"
        Port int-br-data
            Interface int-br-data
                type: patch
                options: {peer=phy-br-data}
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
    ovs_version: "2.9.2"

Broken server /var/log/juju/unit-neutron-openvswitch.log
These are the final lines before the machine loses connectivity on the admin network (eno1).

2020-05-26 18:08:02 DEBUG config-changed net.netfilter.nf_conntrack_max = 1000000
2020-05-26 18:08:02 DEBUG config-changed net.ipv4.neigh.default.gc_thresh2 = 28672
2020-05-26 18:08:02 DEBUG config-changed net.ipv6.neigh.default.gc_thresh1 = 128
2020-05-26 18:08:02 DEBUG config-changed net.nf_conntrack_max = 1000000
2020-05-26 18:08:02 DEBUG config-changed sysctl: setting key "net.netfilter.nf_conntrack_buckets"
2020-05-26 18:08:02 DEBUG config-changed net.ipv4.neigh.default.gc_thresh3 = 32768
2020-05-26 18:08:02 DEBUG config-changed net.ipv4.neigh.default.gc_thresh1 = 128
2020-05-26 18:08:02 DEBUG config-changed net.ipv6.neigh.default.gc_thresh2 = 28672
2020-05-26 18:08:02 DEBUG config-changed net.ipv6.neigh.default.gc_thresh3 = 32768
2020-05-26 18:08:02 DEBUG config-changed active
2020-05-26 18:08:03 INFO juju-log Creating bridge br-int
2020-05-26 18:08:03 INFO juju-log Creating bridge br-ex
2020-05-26 18:08:03 WARNING juju-log Support for use of upstream ``apt_pkg`` module in conjunctionwith charm-helpers is deprecated since 2019-06-25
2020-05-26 18:08:03 INFO juju-log Creating bridge br-data
2020-05-26 18:08:03 DEBUG juju-log Interface eno1 is not a Linux bridge
2020-05-26 18:08:03 INFO juju-log Adding port eno1 to bridge br-data
2020-05-26 18:08:03 DEBUG config-changed Failed to restart os-charm-phy-nic-mtu.service: Unit os-charm-phy-nic-mtu.service not found.

Then, we see the following (only accessible on site or by coming in through the eno2 connection):

2020-05-26 18:08:53 ERROR juju.api monitor.go:59 health ping timed out after 30s
2020-05-26 18:08:53 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2020-05-26 18:08:53 INFO juju-log Loaded template from templates/queens/openvswitch_agent.ini
2020-05-26 18:08:53 INFO juju-log Rendering from template: /etc/neutron/plugins/ml2/openvswitch_agent.ini
2020-05-26 18:08:53 INFO juju-log Wrote template /etc/neutron/plugins/ml2/openvswitch_agent.ini.
2020-05-26 18:08:54 DEBUG juju-log Generating template context for amqp
2020-05-26 18:08:54 DEBUG config-changed Traceback (most recent call last):
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/config-changed", line 266, in <module>
2020-05-26 18:08:54 DEBUG config-changed     main()
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/config-changed", line 259, in main
2020-05-26 18:08:54 DEBUG config-changed     hooks.execute(sys.argv)
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/core/hookenv.py", line 914, in execute
2020-05-26 18:08:54 DEBUG config-changed     self._hooks[hook_name]()
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1568, in wrapped_f
2020-05-26 18:08:54 DEBUG config-changed     stopstart, restart_functions)
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/core/host.py", line 741, in restart_on_change_helper
2020-05-26 18:08:54 DEBUG config-changed     r = lambda_f()
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1567, in <lambda>
2020-05-26 18:08:54 DEBUG config-changed     (lambda: f(*args, **kwargs)), __restart_map_cache['cache'],
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/config-changed", line 150, in config_changed
2020-05-26 18:08:54 DEBUG config-changed     CONFIGS.write_all()
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 334, in write_all
2020-05-26 18:08:54 DEBUG config-changed     [self.write(k) for k in six.iterkeys(self.templates)]
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 334, in <listcomp>
2020-05-26 18:08:54 DEBUG config-changed     [self.write(k) for k in six.iterkeys(self.templates)]
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 321, in write
2020-05-26 18:08:54 DEBUG config-changed     _out = self.render(config_file)
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 281, in render
2020-05-26 18:08:54 DEBUG config-changed     ctxt = ostmpl.context()
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/openstack/templating.py", line 112, in context
2020-05-26 18:08:54 DEBUG config-changed     _ctxt = context()
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/neutron_ovs_context.py", line 633, in __call__
2020-05-26 18:08:54 DEBUG config-changed     host_ip = get_relation_ip('neutron-plugin')
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/contrib/network/ip.py", line 583, in get_relation_ip
2020-05-26 18:08:54 DEBUG config-changed     address = network_get_primary_address(interface)
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/core/hookenv.py", line 1043, in inner_translate_exc2
2020-05-26 18:08:54 DEBUG config-changed     return f(*args, **kwargs)
2020-05-26 18:08:54 DEBUG config-changed   File "/var/lib/juju/agents/unit-neutron-openvswitch-43/charm/hooks/charmhelpers/core/hookenv.py", line 1239, in network_get_primary_address
2020-05-26 18:08:54 DEBUG config-changed     stderr=subprocess.STDOUT).decode('UTF-8').strip()
2020-05-26 18:08:54 DEBUG config-changed   File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
2020-05-26 18:08:54 DEBUG config-changed     **kwargs).stdout
2020-05-26 18:08:54 DEBUG config-changed   File "/usr/lib/python3.5/subprocess.py", line 708, in run
2020-05-26 18:08:54 DEBUG config-changed     output=stdout, stderr=stderr)
2020-05-26 18:08:54 DEBUG config-changed subprocess.CalledProcessError: Command '['network-get', '--primary-address', 'neutron-plugin']' returned non-zero exit status 1
2020-05-26 18:08:54 ERROR juju.worker.uniter.operation runhook.go:113 hook "config-changed" failed: exit status 1
2020-05-26 18:09:13 INFO juju-log Registered config file: /etc/neutron/neutron.conf
2020-05-26 18:09:13 INFO juju-log Registered config file: /etc/neutron/plugins/ml2/openvswitch_agent.ini

SOLVED!

It turns out that after the upgrade to Queens JuJu was handing out a bad network config to this server. In addition, the OpenVSwitch install was assigning eno1 to br-data instead of creating br-eno1 like on my other servers. The steps to resolve the problem were:

  • Remove eno1 from the br-data bridge: ovs-vsctl del-port br-data eno1
  • Copy the functional config from another working server to this servers /etc/network/interfaces file and comment out the line that reads the (busted) cloud config file from /etc/network/interface.d/50-cloud-init.cfg
  • Update the IPs in the new interfaces file to those found in ifconfig for the eno1 and eno2 interfaces
  • Reboot
  • Profit

I don’t yet know exactly what caused JuJu to stop sending a proper network config after the upgrade.

My final interfaces file looked like this. Anyone else copying this file will of course have to change all of their IPs.

auto lo
iface lo inet loopback

auto lo
iface lo inet loopback
    dns-nameservers 10.10.30.99 10.244.0.66 10.244.0.67
    dns-search maas

auto eno1
iface eno1 inet manual
    mtu 1500

auto eno2
iface eno2 inet static
    address 10.189.134.103/24
    dns-nameservers 10.189.134.99 10.244.0.66 10.244.0.67
    mtu 1500

auto br-eno1
iface br-eno1 inet static
    address 10.10.30.101/24
    dns-nameservers 10.10.30.99 10.244.0.66 10.244.0.67
    gateway 10.10.30.254
    bridge_ports eno1

I found the following sites helpful when troubleshooting:

1 Like

This might be worth creating a bug around, so we can do a proper investigation.

1 Like