When to send an application into an error state

Warning - opinion ahead!

The error state is useful for charm authors, but not helpful for production. If I’m someone who deploys charms (but not write them), I don’t have the expertise to fix the problem.

Here’s a relatively common pattern for resolving charm errors in a live model:

  • run juju debug-log --include <unit-id> (but how did I learn about this command? it’s not included within any of the charm’s documentation)
  • run juju debug-hook sends me into a tmux session (what if I’ve never used tmux?)
  • edit the charm (but then I need to where to find its source code. How do people learn to look in /var/lib/juju/...?)
  • cross fingers
  • then run juju resolve

It can be simpler to remove an application and re-deploy. But this can impose significant downtime in production.

The error state is a very powerful hammer. It suspends the charm, and isolates the ability for a poorly-coded charm to do further damage. But that doesn’t really help people who are running live workloads. @zicklag has a very relevant anecdote:

Juju charms, for safety purposes have a rather conservative behavior when any charm hook fails. The charm will go into an error state and essentially freeze all operations ( other than retrying the previous operation ) to prevent data loss. The issue with this is that that it tends to result in cascading irrecoverable error states when faced with a bug in the charm.

For example, I ran into a situation where I had an HTTP proxy charm that had a bug that caused a hook failure under certain conditions. I related this HTTP proxy charm ( not knowing about the bug ) to a Grafana charm. When the proxy charm went into an errored state the only way to fix it was to force remove the charm. I force removed the proxy charm, but then Grafana’s hook errored out, not because of a bug in Grafana, but because the charm’s Juju agent was eternally trying to respond to a relation hook event on a relation that no longer existed, since the force removal of the proxy.

My only recourse was to force remove Grafana. If that Grafana charm had been related to a database, the database charm probably would have errored and had to be force removed as well, and so on.

That only happened in a dev environment but that is a very scary thing to put into production. One unforeseen situation in charm code could necessitate the removal of my entire production stack. And that’s not all.

Is this something that’s fixable? What are your suggestions for improving the situation?

3 Likes

Good topic! :slight_smile:

I think that maybe the biggest problem in my case was just the fact that the error state of one charm, could force an error state in another charm.

I think that charms should ( maybe ) almost never intentionally go into an error state unless something is horribly wrong, which has been stated in the charming documentation. The issue is with charm bugs or environmental problems ( such as lack of disk space or something, maybe ) which are obviously not intentional.

When one charm errors, if you had to force remove it and replace it, it shouldn’t cause another perfectly working charm to error because the Juju agent can’t find the relation that was force removed.

You nailed the process and hurdles perfect here @timClicks.

Tmux session is the beautiful example on how you lose 80% devops and enter guruland.

I had many many hours going in to discover these trix by myself and it’s perhaps me being a noob, but I’m definitely not alone. However, I’m also afraid that this dark magic of debugging juju charms also makes us noobs unable to charm.

We also want to learn how to thow fireballs.

I’ve thought about this a lot over the years, and from my perspective the problem needs to be addressed at two very different levels.

Firstly, we can make charms error less often by writing them in more robust, foolproof ways, probably encouraged by the nascent Operator Framework. There are a lot of unusual situations that one has to handle oneself in reactive or pre-reactive charms, particularly around the precise mechanics of relation events. Whether they be Juju hook tool error handling problems or interface data workflow edge cases, their rarity and variety mean an individual charm author is unlikely to get things right until someone runs into them with that particular charm, probably during a production outage.

I think that a combination of things we already know we want to focus on will help:

  • An Operator Framework that provides a simpler, Pythonic interface to the world, that’s harder to get wrong as the author of a pretty standard charm
  • Encouragement to use idempotent charm designs that depend less on specific events and more on the current state of the model (e.g. relation-changed examines all relations and writes them out, rather than just appending the new one), minimising edge cases and reducing the complexity of mutable global state
  • Easy and solid charm test infrastructure that makes checking edge cases easy
  • Auxiliary libraries on top of the Operator Framework that implement common relations (e.g. website, juju-info and pgsql) in a reusable way that handles the sharper bits
  • Components in the Operator Framework to implement common relation patterns (e.g. the database request conversation that underlies relations like pgsql and mysql), making implementing robust new ones simple
  • A carefully chosen set of high-quality, well-engineered charms, proven in production, that set a good example for newer charm authors

But even with the improvements we expect from the new charming world, sometimes charms are legitimately going to error (or block). This isn’t necessarily due to a bug or deficiency in the charm; there is a point at which the charm just doesn’t know enough about the situation to resolve all problems itself. It requires knowledge beyond the single application and its relations: knowledge about the model’s higher level workload, and even the environment and circumstances in which this particular deployment of the workload operates. For some Juju use cases encoding this additional logic in code is important and the extra complexity and cost is worth it, while for others it’s either unsafe or too expensive.

For the api.snapcraft.io workload I want any database issue to wake me up within thirty seconds so I can get my ops team onto the problem, communicate with our most critical customers, and potentially shed load or fail over to another region; we have strict uptime requirements, skilled ops staff that know the service well and can analyse unanticipated failures better than the charm can, and enough load that doing the wrong thing can cause cascading failure. But if the blog.launchpad.net MySQL primary unit falls over, it’s probably okay for the service to be down for a couple of minutes while the charm automatically fails over; it doesn’t block millions of customers when it’s down, it doesn’t have a dedicated ops team, and automatically taking the wrong action isn’t likely to make the problem worse (at worst the blog is down for an hour while I restore the small DB from backups).

So I’m not sure solving these higher-level failures at the charm level is right, sensible, or even possible. There’s extra logic that needs to be encoded above charm level; almost like an operator for a bundle, a workload operator, an operator operator, or a meta-charm. Like a charm, it’s replacing a set of responsibilities which were classically held by a human operator. But it’s a very different level of decision making from application charms.

Some of these workload operators would be generic, and handle lots of situations themselves: I could configure WordPress to take a broken MySQL standby out of rotation because the load is low enough and I don’t want unnecessary model flux while I’m asleep; or I could configure it to respond to failure by make caching more aggressive, shedding search load, and immediately redeploying the database because I’m frequently at the top of Hacker News and need the capacity.

But for sites like api.snapcraft.io the workload operator is likely to hardcode a few simple rules: if an appserver unit fails, replace it; if a frontend fails and we don’t have enough TLS capacity left over, reduce search result sizes while we replace it; if the PostgreSQL primary fails, set the application to read-only and post an outage message on the developer status page; if anything else is wrong, fire off an event and let normal LMA infrastructure notify the right people.

I’d definitely be interested to hear others’ opinions on these matters.

1 Like

I believe that taking a very conservative stance on the consuming side of the relation would help alleviate the problem of cascading errors. For instance:

  1. The combination of checking relation validity at every point where our consuming charm would try to reach out to the other side might be helpful;
  2. Robust exception handling and then quick circuit breaking when the relation no longer exists;
  3. Yet another useful tactic that has limited applicability is to reduce the number of instances where the consuming charm has to reach out to the other side to get information.

However, I think that no matter how hard we try to prevent these things from ever occurring at least once, something will always fall through the cracks. So it’s just as important to put up tooling around our code to quickly re-create the problem in our workstations and then prevent future regressions. So robust, readable, maintainable automated tests need to be encouraged in every charm from the beginning. Well-structured, testable charm code also follows that.

That’s a page I’m taking from my CI/CD experience. Another one that I’d like to put forward is the concept of blue-green deployments. Charms and the workload they manage should be viewed for what they are: SaaS applications. So it may do well to adopt SaaS-oriented CI/CD deployment strategies similar to the Staging and Release phases of this diagram.

Likewise, the charm and the workload itself must codify this reality. For instance, it should always aim to migrate underlying database schema (if any) in a backwards-compatible way so that the currently running version can still use it while the new version being deployed is working its way towards full deployment.

I agree that charms should be written defensively, but that is true of any code project that you work on. (Validate the input parameters before reacting to them.) I’m actually rather surprised that an error in one charm would cause an error in another. If one application goes into an error state, then it doesn’t set any relation data for the other application to read. Which seems a much safer position than having half-of-the-data be written because it aborted half way through.

In my mind, this isn’t any different from what should you do if you pass an integer to a function that expects a string. Should that function cast every attribute passed to it to a str? Most of the time that is actually the wrong behavior, which will cause unpredictable behavior which is harder to debug than just having a clear error.

The problem of silently ignoring issues, is that then a small issue that isn’t caught ends up slowly making things worse, and then you get the same sort of cascading failures, it just is much harder to figure out the source of the problem. (Because the actual problem is 3 times removed from where you notice the problem.)

I absolutely do understand the issue that the person who developed the charm code is likely not to be the person running it in production. So there are times where WARNING is much more appropriate than failure. (yes there is a bug, and that part is broken, but all the other parts could continue to operate.)

I know we have talked a bit in Juju about being able to run a charm in “strict” mode, so that you would get clear errors if you used deprecated methods, etc. It might be interesting to be able to hook into something like that, and have ways of ‘these are non-critical issues’, but if I’m running the charm as the developer, I want tracebacks and error states so that I immediately know to what I need to fix.

Certainly in Juju code itself, we follow a pattern of “don’t put panic() in production code”, but I think having “don’t raise exceptions” would surprise people writing in Python.

Actually I don’t mean erroneous data passed over charm relations, I mean

The problem is when a charm ran into an internal charm error and I had to force remove it. The force removal of one charm is what broke the other charm.

I did get a good point from @thumper on that, though.