ulimit/LimitMEMLOCK on Focal?

routergod · 9 October 2020 16:21

I installed charmed-kubernetes in Openstack and ran in to a problem with the OCCM pods installed by openstack-integrator charm (the cdk-addons module). The OCCM pods are crashing on startup with the message

runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

I think this is caused by LimitMEMLOCK in the juju .service files.

$ juju run --application kubernetes-worker -- ulimit -l
- Stdout: |
    64
  UnitId: kubernetes-worker/1
- Stdout: |
    64
  UnitId: kubernetes-worker/2
- Stdout: |
    64
  UnitId: kubernetes-worker/0

$ juju run --unit kubernetes-worker/0 -- systemctl show jujud-unit-kubernetes-worker-0.service | grep MEMLOCK
LimitMEMLOCK=65536
LimitMEMLOCKSoft=65536

It is set this way on all the juju units in this model. It seems it is a Focal thing. In a different model (a MAAS one) I have;

$ juju run --unit syslog-server/3 -- systemctl show jujud-syslog-server-3.service | grep MEMLOCK
LimitMEMLOCK=16777216
LimitMEMLOCKSoft=16777216
$ juju run --unit vault/0 -- systemctl show jujud-unit-vault-0.service | grep MEMLOCK         
LimitMEMLOCK=65536
LimitMEMLOCKSoft=65536

In this case, syslog-server is on Bionic, vault on Focal.

Can someone help me understand this please?

wallyworld · 13 October 2020 04:19

The only limit Juju explicitly sets in it systemd conf files it creates is “LimitNOFILE”, eg

$ cat /etc/systemd/system/jujud-machine-1.service 
[Unit]
Description=juju agent for machine-1
After=syslog.target
After=network.target
After=systemd-user-sessions.service

[Service]
Environment="JUJU_DEV_FEATURE_FLAGS=developer-mode,image-metadata,juju-v3"
LimitNOFILE=64000
ExecStart=/etc/systemd/system/jujud-machine-1-exec-start.sh
Restart=on-failure
TimeoutSec=300

[Install]
WantedBy=multi-user.target

So anything else comes from systemd. I did a quick test on bionic and focal on the LXD provider and got 65536 in both cases.

routergod · 13 October 2020 18:51

Many thanks, this is interesting. Are you able to explain the following please?

routergod@management:~$ juju status kubernetes-worker/0
Model  Controller   Cloud/Region         Version  SLA          Timestamp
k8s    domaintrust  openstack/RegionOne  2.8.3    unsupported  18:44:39Z

App                Version  Status  Scale  Charm              Store       Rev  OS      Notes
containerd         1.3.3    active      1  containerd         jujucharms   94  ubuntu
flannel            0.11.0   active      1  flannel            jujucharms  506  ubuntu
kubernetes-worker  1.19.0   active      1  kubernetes-worker  jujucharms  704  ubuntu  exposed

Unit                  Workload  Agent  Machine  Public address  Ports           Message
kubernetes-worker/0*  active    idle   6        10.0.27.186     80/tcp,443/tcp  Kubernetes worker running.
  containerd/1        active    idle            10.0.27.186                     Container runtime available
  flannel/1           active    idle            10.0.27.186                     Flannel subnet 10.1.30.1/24

Machine  State    DNS          Inst id                               Series  AZ    Message
6        started  10.0.27.186  d3e5f404-704b-42e0-9a8a-d2759e182817  focal   nova  ACTIVE

routergod@management:~$ juju run --machine 6 -- ulimit -l
65536
routergod@management:~$ juju run --unit kubernetes-worker/0 -- ulimit -l
64

Many thanks in advance!

wallyworld · 14 October 2020 03:09

I can’t explain this at the moment. I tried it for a simple test deployment and got 64 in both cases.
Just to rule out something weird, you could juju run ip address or something to confirm that the same target machine is being hit in both cases.

routergod · 14 October 2020 08:07

Sure.

routergod@management:~$ juju run --machine 6 -- ip -o ad | grep ens3
2: ens3    inet 10.0.27.186/24 brd 10.0.27.255 scope global dynamic ens3\       valid_lft 75381sec preferred_lft 75381sec
2: ens3    inet6 fe80::f816:3eff:fe5a:1b0a/64 scope link \       valid_lft forever preferred_lft forever
routergod@management:~$ juju run --unit kubernetes-worker/0 -- ip -o ad | grep ens3
2: ens3    inet 10.0.27.186/24 brd 10.0.27.255 scope global dynamic ens3\       valid_lft 75355sec preferred_lft 75355sec
2: ens3    inet6 fe80::f816:3eff:fe5a:1b0a/64 scope link \       valid_lft forever preferred_lft forever

I notice the same oddity in every unit in this openstack model (etcd and all the others).

wallyworld · 15 October 2020 06:21

If you ssh into the machine and run ulimit directly, I wonder which of the two scenarios it matches. There’s definitely something weird happening that I have no answer for.

routergod · 15 October 2020 17:25

ubuntu@juju-8cb85c-k8s-6:~$ ulimit -l
65536

But here’s the rub;

routergod@management:~$ kubectl exec hello-node-7567d9fdc9-dxnf7  --tty --stdin -- /bin/bash
root@hello-node-7567d9fdc9-dxnf7:/# ulimit -l
64

My brain hurts.

jameinel · 15 October 2020 17:34

I wonder if this is a case of something about the process/charm itself configuring ulimits. On LXD I definitely see 64 in both cases.
Even with ssh, I get:

ubuntu@juju-1f04ed-0:~$ ulimit -l
64
ubuntu@juju-1f04ed-0:~$ sudo -i
ulimit -l
root@juju-1f04ed-0:~# ulimit -l
64

As for LimitMEMLOCK being set, I know the systemd wrapper we have has support for changing memlock (and we set memlock to UNLIMITED for juju-db).

I don’t see other locations where juju is trying to control those things for unit agents, etc.
I also wonder if those are bionic vs focal vs xenial differences.

erik-lonroth · 15 October 2020 20:40

To track down this issue, start backtracking the process tree to see where values are changed.

Use pstree to see the process tree.

pstree

Use then
cat /proc/<PID>/limits

To see if you have a unwanted setting and from that discover where you lose your setting. The parent process will normally be setting the child limits.

I wrote a similar help in this thread for snap How to customize systemd service created by snapd? - snapd - snapcraft.io

routergod · 16 October 2020 12:14

Thanks for the tip. Actually I got a good clue with the more brutal;

grep locked /proc/*/limits

@wallyworld @jameinel it seems that the only process not having a 64K mem lock limit is PID 1 systemd.

/proc/1/limits:Max locked memory         67108864             67108864             bytes

All the other processes are like these;

# a jujud
/proc/129776/limits:Max locked memory         65536                65536                bytes
# snapd
/proc/25312/limits:Max locked memory         65536                65536                bytes
# atd
/proc/862/limits:Max locked memory         65536                65536                bytes

So this is a Focal systemd thing, not a juju thing and seems like the normal behavior. The values here line up with the LimitMEMLOCK settings for the services, although these are all inherited. There is no .service file in /usr/lib/systemd/*/ which sets LimitMEMLOCK explicitly.

When I log in interactively I get the higher limit;

/proc/self/limits:Max locked memory         67108864             67108864             bytes

In my case is appears that the 64K limit not sufficient to allow the openstack-cloud-controller-manager pods to execute. I manually added LimitMEMLOCK=infinity to the containerd.service files and now my OCCM pods are running.

Thing is I don’t get that the higher limit is actually required for the OCCM. And clearly this is not specifically a juju problem that I am suffering. I already reported this as a CDK Addons bug. To be honest I don’t really know where it sits.

All that aside, it seems that ‘run --machine’ is reporting the systemd value, not the jujud-machine-x value, for some reason?

routergod · 16 October 2020 19:19

This looks like the cause of my woes.

afreiberger · 5 August 2021 16:04

@routergod If this is still an issue for you, we’ve found a workaround which I’ve documented in the bug lp#1898726.

erik-lonroth · 9 August 2021 01:20

snap, systemd and limits is a difficult situation… We have fought this a long time with @jamesbeedy I wrote up some comments on this and how to pursue this in the snapcraft discourse… How to customize systemd service created by snapd? - snapd - snapcraft.io