Software quality is not a myth: Real examples

Flavien Huynh

As this is the last part of our software quality introduction series, it is time to look at some real examples.

From its concept to the quality-fluent team, we have explored:

That’s all very well, but even with good indicators, role-based features, and an efficient quality model it’s still additional work on top of coding.

So, is it really worth it?
Well, it is. Software quality is definitely not a myth.
And if you need convincing, let’s take a look at some real-life coding horror stories.

They’re everywhere!

These days coding issues regularly pop up in tech news, or even mainstream media.

From hardware or software vulnerabilities exploits, to account lock outs, IoT glitches or OTA mishaps.
We have somewhat come to accept bugs as an identified (even expected) part of a software.

This is not such a big issue if the software is a video game or a birdspotting app.
But as software finds its way into almost all human activities, sometimes bugs can be physically harmful.

Do we have these bugs because of increasing complexity?
Maybe dense technology layers are more subject to quality issues?
Not really.

Software quality: real examples

There is a reason why software quality is not a myth, it has always been an issue.
As soon as software was good enough to help us manage money, vehicles, and medical equipment, bugs could have impacts beyond the computer screen.
We learnt from our mistakes. Regrettably, sometimes at the cost of human lives.

Here are some very real examples showing grave effects of software malfunctions.
We’re not saying Software Quality would have prevented them all, but based on this experience, a quality model today could raise flags to avoid issues before they become dangerous.

Overflow issues

1991: Civilization video game
Ref: Civilization aggression “level bug”
Substracting 2 from 1 in an 8 bit variable led to a 255 value.
As a result, a pacific character transformed into a warmonger.

1996: Ariane 5 rocket explosion
Ref: Disasters caused by computer arithmetic errors
The velocity value overflowed its 16 bit integer storage.
Extensive unit tests and dynamic code checkers may have detected the issue.

Test issues

1991: Patriot Missile Failing to intercept Scud missile
Ref: Disasters caused by computer arithmetic errors
Precision loss bug caused erroneous time results over time.
Unit tests coupled to realistic test campaigns could have detected this issue.

1997: Pathfinder system reset
Refs: What really happened on Mars Rover Pathfinder, The Pathfinder Reboot Problem
A priority inversion occurred when a lower priority task didn’t release a MUTEX.
Dynamic code checking and testing might have helped.

1999: Mars Polar Lander crash
Refs: NASA mission, Wikipedia article
The thrust engine cut off too early during descent, after a misinterpreted vibration.
Expanding the testing scope could have helped detect this scenario

2000: The infamous Y2K bug
Ref: Wikipedia article
Storing dates on two digits provokes bugs when handling dates between two centuries.
Comprehensive unit tests should have found this.

Correction: The Y2K bug was known long before it became the focus of 1999 news. In this case tests are not necessary to reveal the issue, but to detect if impacts are serious or not.

Quality checking issues

1985: Therac-25 software: massive radiation overdose
Refs: Computer magazine study, Wikipedia article
Several software issues and the lack of quality practices put patients in lethal danger. Among causes were bad failure management, no integration tests, unsafely used legacy code and overflow.

1992: London Ambulance Service Computer Aided Dispatch System Failure
Refs: Wired article, Research paper
Released dispatch software had issues and insufficient load testing.
Quality assurance linked to configuration management would have helped monitor and detect issues.

2007: Toyota acceleration issue
Ref: Toyota Unintended Acceleration
Very large and complex code produced severe maintainability issues.
Experts found bit flips, disabled failsafes, memory corruption, overflows, thousands of global variables.
A Software Quality model could have raised numerous alerts in this case.

2008: Heathrow T5 issues at opening day
Ref: British Airways reveals what went wrong with Terminal 5
Delayed and cancelled load tests, debug code left in release version caused massive luggage mishandling, and flights cancellations.
Continuous quality monitoring is useful to avoid “project deadline rush”, and make sure tests and code quality are checked early, and often.

Neverending story?

These examples paint a rather dark picture, and the list of software malfunctions seems to grow in all areas.
So, if software quality is not a myth, are we doomed, do we witness a “bugs uprising”?

I don’t think so.
Today, we are at a point where Software Quality has expanded in several directions.
It helps us learn what not to do, and what to do better.
In a beautiful circle, software tools can help us to specify, design, write, test, report our own software creations.

And the quality model, fed by the experiences we made, can assist us to monitor quality, anticipate issues, and enhance our maturity.

Hence, software quality is worth its while, it builds knowledge to improve our code, and perhaps ourselves too 🙂

4 thoughts on “Software quality is not a myth: Real examples”

Andy
19 September 2020 at 19:06
Good article, but we were completely aware of the Y2K bug and didn’t need comprehensive unit tests to tell us about it. It was a conscious decision to use 2 bytes instead of 4 to store the year because the cost of memory in those days was significant enough for this sort of optimization to be worth it. Y2K wasn’t a surprise, it was known about decades in advance. It’s sort of like climate change – we know it’s happening but until the water is actually outside the door we’re not going to do anything about it. It’s tomorrow’s problem and we have business constraints today.
Reply
- Flavien Huynh
  22 September 2020 at 09:05
  Hi Andy, thank you for your comment.
  I agree, the Y2K bug was a perfect storm of a known limitation, a lot of hype, and the inescapable end-of-millenium angst.
  Fortunately, the bug didn’t always have serious impacts, and most sensitive software had been fixed in advance anyway.
  It’s interesting how milestones keep appearing on our computing path: crossing the 640K memory limit, expanding the two-digits date encoding, developing massively parallel algorithms.
  Wonders never cease! 🙂
  Reply
John Wilson
25 March 2021 at 19:07
Hmm. “That’s all very well, but even with good indicators, role-based features, and an efficient quality model it’s still additional work on top of coding.” Is it? Why? Why isn’t it part of coding, or do we expect to inspect the code after it’s written for non conformances? Surely a coder only delivers working code?
Reply
Flavien Huynh
9 April 2021 at 16:23
Hi.
I agree with you, and to be more precise, quality should not be work after the code is delivered.
It’s like tasting your dish when it’s served: it might be possible to adjust the salt, but not everything can be fixed when you’re finished.
Working code is one of the objectives for sure, that’s where we fix the visible faults. As for the invisible faults, they should be detected as soon as possible, and quality models are here to help do that, when it’s still easy to do.
Reply