Per Yuan et al., Based On A Study Of Five Distributed Software Systems, These Three Code Smells Cause 35% Of Catastrophic Outages

Ayo Ijidakinro
6 min readOct 10, 2023

I manage software engineering for an application used at 40,000+ locations across the United States and Canada.

An outage can cost thousands, even millions.

Therefore, I review research papers and books to get hard data to pass on to my engineering team to boost code stability and reliability.

Buried in a 2014 paper, presented during the 11th Proceedings of the USENIX OSDI 2014 conference, is a surprising statistic

Researchers Ding Yuan, Yu Luo, et. al, found the following across 198 randomly selected catastrophic failures.

For the five distributed applications studied, 35% of catastrophic system failures resulted from three easily identifiable and correctable code smells.

Even better…

You can detect these three issues via code review and static analyzers

The paper states that in their review of the 198 randomly-selected catastrophic failures:

35% of the catastrophic failures are caused by trivial mistakes in error handling logic — ones that simply violate best programming practices; and that can be detected without system specific knowledge.

For reference, the paper is titled: Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems.

The distributed applications studied were: Cassandra DB, HBase, Hadoop Distributed File System, Hadoop MapReduce, and Redis.

All three of these issues relate to CATCH blocks in try-catch statements

In their research, they found that for these well-tested, distributed, applications ‘92% of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software”.

These popular distributed systems are well tested.

So, the solution is not as simple as adding extra unit tests.

But this research paper shows that we can find these errors during code reviews and via static analyzers.

To do so, we need our teams to be aware to keep an eye out for these code smells.

The heuristic they discover is to pay attention to catch blocks. Most of the failures came from poor handling of non-fatal failure modes.

I myself have seen two of these issues first-hand.

Below are the three code smells from the paper.

Code Smell #1: The log-only error handler

The first case is having an error handler that logs the error, but does nothing else.

25% of catastrophic failures were caused by this case alone.

try {
...
}
catch(...) {
Logger.Warn(...);
}

Whether this is done unintentionally or on purpose, Yuan et al. found that this type of catch-block can cause non-fatal issues to fester and culminate in application collapse.

The solution

Developers should take a hard look at all non-fatal error logging to see if:

  1. It was a mistake to log only, and if the developer meant to re-throw the exception
  2. If the developer is taking an easier route — by logging the issue — rather than taking advantage of proactive steps for runt-time mitigation

If a catch block is empty, the above suggestions apply as well.

An error handler that solely logs is not always wrong.

But it should cause your spidey-sense to tingle.

Code Smell #2: Re-throwing an exception on a generic exception handler

On the opposite end of the spectrum is the error handler that’s too broad.

This issue caused 8% of avoidable catastrophic failures.

For example code like the below:

try {
...
}
catch(Exception ex) { //catching the base class Exception will catch ALL exception types that inherit from Exception
...
throw;
}

Such handlers abort an entire application when perhaps only one of many possible exception types is fatal.

For example, imagine that certain file handling code can encounter the below exceptions:

  • The disk is full
  • The disk is busy
  • The file is locked

Most likely, only one or two of those exceptions is fatal.

Also, each of those exceptions should likely have different error handling logic.

If a developer is handling ALL exception types with a single catch-block, non-fatal issues can trigger catastrophic failures. (Especially, a catch block that naively re-throws.)

An easily recoverable problem could be treated with the same panic as a transient issue.

For example, a busy disk exception being treated with the same panic as disk full.

The solution

Check to see if the developer has thought through the nuances of the different exception types a block of code may trigger.

Then, check that each exception type is handled appropriately.

Some exceptions may require a re-throw.

But, can some exception types be handled gracefully?

As always, exception handling should be as specific as possible.

Code reviews and static code analysis can easily spot this danger before the overly generic error handler makes it to production.

Code Smell #3: The ‘To-Do’ Error Handler

Only 2% of failures were caused by this issue, but it is still worth mentioning.

As a developer builds a large feature, he may add comments like “FIXME” or “TODO” to remind himself to come back and address an issue.

Indeed, when following an Agile methodology, some production code may ship with “FIXME” and “TODO” still in place.

However, when statements like “FIXME” and “TODO” are in an exception catch-block, BE VIGILANT!

It could be that the programmer forgot to go back and add error handling.

Or, the developer may truly believe he can come back to this error handling later.

As you would expect, Yuan et al. found that these sorts of comments in error handling blocks become explosive landmines in production.

The solution

During code reviews, or via static analysis, scan the codebase for TODO-like comments in error handlers.

Verify that these are not an oversight or improper postponement by the developer.

Again, it’s too strict of a rule to say that code must ship with zero TODOs.

But, when a TODOs is found in error handling code, you should be concerned.

A final bonus thought

Beyond the three above mentioned smells, the study had another alarming statistic.

In 23% of catastrophic failures, error handling logic for a non-fatal error was flawed.

These flaws were so evident that basic statement coverage testing or more thorough code reviews would have caught them.

So, again, having error handlers in our code is excellent.

But, we need to pay close attention to the content of error handlers. We need to verify that error handling code itself is correct and robust.

This is one area where targeted automated tests could help.

Conclusion

Yuan, et al.’s research gives us three powerful code review guidelines that we can use to make sure our developers ship robust, reliable code.

Adding exception handling to our software is great. However, to make our systems robust, code reviews must pay close attention to the contents of exception handlers.

  1. Is the handler catching too broad an array of exceptions?
  2. Is the handler handling each exception type with the appropriate nuance?
  3. Is the handler unintentionally swallowing the exception?
  4. If a handler is intentionally swallowing an exception, does it attempt reasonable mitigation?

Finally, if available for your language, use a good static code analyzer that helps you find these three code smells.

Per Yuan et al., these changes to your code reviews and static analyses could reduce your catastrophic production outages up to 35%!

Put these tips into action NOW to help your team release software that is more reliable and robust!

Failure to monitor these code smells will leave you with software that is less reliable, with a higher occurrence of incidents!

Let me know how these tips help your software dev team! Or if you disagree with any of the findings from this paper.

--

--

Ayo Ijidakinro

I’m a software engineer turned entrepreneur. Technology, SEO, and Marketing are my passions. Over the last 36-months my ads have made $1.36+ million in sales.