Juju status response time

soumplis · 2 April 2020 13:31

Hi Juju team,
The release 2.7.5 is great and relieves a great pain we have had with “juju status” on our model (currently with 2465 units and 309 machines). However FYI it still has some issues.

Plain “juju status” is somewhat 5-8 times faster:

ubuntu@cfp001:~$ time juju status > /dev/null

real	0m4.078s
user	0m0.629s
sys	0m0.110s

This was at best 20 sec and usually close to 40 sec.
However, if I just query for something specific, the problem is still here:

ubuntu@cfp001:~$ time juju status mysql2 > /dev/null

real	0m38.221s
user	0m0.234s
sys	0m0.050s

The above results are indicative over a series of a dozen test runs.

Shall I open a bug on this ?

erik-lonroth · 2 April 2020 19:06

Very interesting.

May I ask what your model contains with that many units?

Also, what is your experience or/and approach when a configuration option gets changed and it needs to propagate to many hundreds of units? Do you manage configuration changes in the charm or does the juju controller take care of this efficiently?

I am venturing into a similar situation with a SLURM HPC cluster with possibly many thousands of units in a single model so any experiences are greatly appreciated.

soumplis · 3 April 2020 07:17

HI Erik,
We have built an OpenStack cluster. This OpenStack currently has 220 compute hosts (some of them are also CEPH hosts) + Control Plane services (all in HA clusters of three units, utilizing LXD containers) + monitoring and other supportive services. Take a look at a little outdated, especially in terms of numbers, presentation I have had last year.

Regardingour experience with handling the model, I am pretty astonished! Configuration changes are propagated promptly to all units and while the controller does all the coordination, we haven’t noticed any load with single config changes. Units take the update without lag and this is quite important especially in cases where you have to revert a change or correct a mistake (we all do mistakes and mistypes, right )

The only case where we have seen heavy load on the controllers is when we do a mass restart of agents. However it is expected as all units simultaneously “hit” the controller.

Another interesting point is with upgrades. The upgrades cover many different aspects of the lifecycle, in more detail:

Model upgrade: This is actually needed to deploy new juju versions. Painless. Fire up the command, wait for some time (ex. an hour) and you are read. We have done many upgrades (over 10) and never hit an issue either with the upgrade, or with load or service impact.
Charm upgrade: Heavily related on the “quality” of the charm. You have to do it carefully especially if there are heavy changes. It is quite straight forward and generally safe. Again, controllers handle this situations easily and without problems.
Application / OS upgrade: This is not tightly related to juju but juju can be a life saver. We have done 2 OpenStack upgrades until now. Juju performed very well and the charms, especially the mature ones, handled the upgrades very well. We have had some issues with new charms, not so heavily tested, and with things that were deprecated or removed where cleanup was not handled sufficiently.

I will be happy to elaborate on our experience and this Juju/OpenStack journey

erik-lonroth · 3 April 2020 07:37

Nice work! @hallback should be interested in this as @lasse would too.

I guess it’s likely the SLURM charms that could be improved in my case. Its good to know people out there are running large workloads with juju with success.

thumper · 6 April 2020 05:17

Hey @soumplis, yes, please file a bug on the timing for 'juju status '. I’ll try to get to it.

There are still areas for improvement in status across the normal use case. Something that I believe that we can improve on as we continue with point releases.

One thing worth noting is that in the 2.7 releases, we now have agent rate limiting to reduce the controller load as the controller restarts. The default settings let about ten agents connect every second. This allows the controller to handle the initial connection and data requests from the agents as they come rather than getting to hammered all at once.

soumplis · 8 April 2020 08:22

Thank you Tim,

The bug is 1871574 and will track the progress there.