Project: {{catchy_title}}

Opinion - Testing Miniseries - Part One - The (Formulated) Cost of Inadequate Testing

2020-11-21T00:00:00+00:00

WIP Disclaimer - All content is subject to change!

I’ve decided to break down my initial attempt at discussing this topic of The Cost of Inadequate Testing in to a miniseries of articles - welcome to Part One.

Often I find that the value of testing is difficult to measure, however the cost of inadequate testing is quantifiable.

There are many projects out there, all with constraints and caveats that make test strategies arguably unique. One might claim that common layers of the stack can be addressed in common manner. For example modern HTTP web servers often include a testing framework, and special runtime environments like Android or iOS have frameworks that provide conventional patterns to setup the environment and make assertions.

However, something that often goes unconsidered is the cost of the tests written and the value they provide.

Formulation and Terms

In this article I shall attempt to formulate the perceived cost to develop a Product over its entire lifecycle. Terms are in bold.

For this thought experiment, we’ll express costs in the units of “developer hours”, the trendier version of a “man hour”.

When we build a Product, we can think of it as a composition of an environment (Infrastructure), some business logic (Features), and undetected bugs + refactoring (Maintenance).

Product = Infrastructure + Features + Maintenance

Caveat: Infrastructure is a topic for another day, for this exercise we can pretend that if the code builds in the Continuous Integration pipeline, then it performs (as authored) in the production environment.

Maintenance encapsulates both updating the business logic as the Product (and our understanding of the business value) evolves, along with addressing defects as they are found. Let’s decompose Maintenance in to Defects and Business Logic Refactoring, where we treat Business Logic Refactoring as a constant cost and therefore omit it from the the latter formulas to reduce complexity. (If you believe this to be hersey, feel free to email me and tell me why you think I’m wrong!)

Next, let’s zoom in on the Features Term. Each Feature needs to be described, designed, built, and updated as the product evolves. Easy enough!

Feature = Design + Implementation

Each of the above terms could be expanded further, but for now let’s unpack the Implementation term - Implementation is best described as writing the code, bringing the design in to the world of the living.

Lastly, let’s assume the Product exists on a timeline, where any given point on the timeline can referred to as T (or t) where Product(T) is the cost of the Product at time T.

And we’ll pretend we can deliver a new feature at each increment of T.

E is the sigma operation, so something like E(T=0, Now) {Foo(T)} describes the sum of Foo for every value of T between 0 and Now. For example:

E(T=0, 4) {2 * T} = 0 + 2 + 4 + 6 + 8 = 20

The left side term will be the total cost of the Product, which each iteration of the formula expressed via the letters of the Greek Alphabet.

$Alpha = Product(Now) = E(T=0, Now) {Infrastructure(T) + Feature(T) + Defects(T)}

Okay we’ve mostly defined the problem space, but we haven’t consider where Testing should go. Let’s add it to the formula by weighting each Term with a coefficient.

If we include sufficient Testing with each Feature, we can assume our Defects will be reduced by say 50% and our cost to develop each Feature is increased by 100%. (Some case studies of TDD claim the number of lines of code for tests and business logic is equal!)

$Beta = Product(Now) = E(T=0, Now) {Infrastructure(T) + Feature(T) * 2 + Defects(T) / 2 }

If we assume addressing Defects is significantly cheaper than developing Features and writing tests, $Beta would be significantly more expensive than $Alpha.

However, if we assume inadequate Testing is occurring, then we might reduce the Feature cost to the original in $Alpha and rely on our Customers and Support Tickets to determine when Defects were introduced to the system. We’ll adjust $Alpha to reflect this and rename it as $Gamma:

$Gamma = Product(Now) = E(T=0, Now) {Infrastructure(T) + Feature(T) + Defects(T) * 2 }

So far $Gamma is the cheapest way to develop software! We should just never test until we find a Defect in the field!

However, we haven’t accounted for the compounding technical debt and increased development time due to an unstable and non-assertable codebase.

If we include the cost for context switching and debugging a misbehaving application that does not have a ground truth assertion to work with, a revised version of the formula might look like:

$Delta = Product(Now) = E(T=0, Now) {Infrastructure(T) + Feature(T) + Feature(T-1) / 2 + Defects(T) * 2}

Oh no!!! We’re paying more than full price for a Feature! And we’ve got twice the Defects!! It’s still cheaper than $Beta though, right?

$Delta = Product(Now) = E(T=0, Now) {Infrastructure(T) + Feature(T) * 1.5 + Defects(T) * 2}

Lastly, let’s consider the situation where Features rely on upon more than the previous Feature, and worst case Feature(T) relies upon Feature(0…T-1).

$Eta = Product(Now) = E(T=0, Now) {Infrastructure(T) + Feature(T) + E(t=0, T-1) {Feature(t) / 2} + Defects(T) * 2}

Now we’re really in trouble, cost has entered in to realm of factorials.

We could add more coefficients to account for the introduced Defects found while maintaining the previous Features, but I think the point has already been made, cost model $Beta is cheaper in the long run.

With cost model $Eta, the technical debt will quickly overwhelm us and soon we won’t be able to make the minimum payment and our Feature development will grind to a halt.

The crux of the problem is that some teams treat testing as an after thought. The code is written, blessed, shipped, and tested. In that order, which only seems cost effective if you don’t look at the entire picture of the Product over its lifecycle.

Conclusion

My claim is that the true cost to develop the Product is cheaper if Testing is done when the Feature is written. Defects never go away for free, which means you’ll either be catching them while the Feature is in context or after it has been shipped and new Features are added to the dependency graph.

I’ve witnessed laborious human-based rituals performed on release candidates by teams of developers taking days (and even in some cases weeks) such that it may be proclaimed from the mountain tops to be defect-free. I will neglect to detail the frequency at which I’ve witnessed these rituals of sacred tribal knowledge be frantically repeated because a defect was introduced during the previous iteration of the ritual.

When one steps back, one must wonder “surely there is a better way, this can’t be the right way, can it?” All we really want at the end of the day is to have confidence in our code and to avoid being called to address an outage in production on Friday at 7PM after having a beer.

In the case that my example of rituals and proclamations is too remote to identify with, I would challenge the reader to recall a time when they were writing code (whether it be fixing a bug, adding a new feature, or just messing around) and they had to conduct some arcane process to assert that the characters they just typed didn’t cause the product to implode.

That feeling of discontent, of inconvenience, is the subtle cost of inadequate testing.

In Part Two of this series I’ll discuss ways to determine the value of a test, where to start on both new and not-new projects, and will provide a practical example of TDD in action.

If you found this post helpful or would like to fuel my caffeine addiction, consider donating.

Opinion - The Cost of Inadequate Testing

2020-11-12T00:00:00+00:00

WIP Disclaimer - All content is subject to change!

# The Cost of Inadequate Testing

Often I find that the value of testing is difficult to measure, however the cost of inadequate testing is quite pronounced.

There are many projects out there, all with constraints and caveats that make test strategies arguably unique. One might claim that common layers of the stack can be addressed in common manner. For example modern HTTP web servers often include a testing framework, and special runtime environments like Android or iOS have frameworks that provide conventional patterns to setup the environment and make assertions.

However, something that often goes unconsidered is the cost of the tests written and the value they provide.

For instance, one might invest a substantial amount of time writing extensive tests to ensure line (and branch) coverage of given component, only to find bugs appear in production in adjacent components, or worse the component is refactored and the tests become obsolete.

In this situation, someone like Kent Beck (author of Test-Driven Development by Example) would argue that the tests were flawed from the start because changes in implementation details should not affect the tests. I'd allege he'd go on to say that the production bugs would have been caught if time was spent adding more contract tests to the adjacent components instead of testing implementation details, or if TDD was used from the start.

However it seems like other extrema of testing is more prominent - an after thought. The code is written, blessed, shipped, and tested. In that order.

I've witnessed laborious human-based rituals performed on release candidates by teams of developers taking days (and even in some cases weeks) such that it may be proclaimed from the mountain tops to be bug-free. I will neglect to detail the frequency at which I've witnessed these rituals of sacred tribal knowledge be frantically repeated because a bug was introduced during the previous iteration of the ritual.

When one steps back, one must wonder "surely there is a better way, this can't be _the right way_, can it?" All we really want at the end of the day is to have confidence in our code and to avoid being called to address an outage in production on Friday at 7PM after having a beer.

In the case that my example of rituals and proclamations is too remote to identify with, I would challenge the reader to recall a time when they were writing code (whether it be fixing a bug, adding a new feature, or just messing around) and they had to conduct some arcane process to assert that the characters they just typed didn't cause the product to implode.

That feeling of discontent, of inconvenience, is the subtle cost of inadequate testing.

## Risk

An old adage in software goes something like "make the common case fast", however it's probably worth adding "and stable". One could interpret this amendment in a few different ways.

### Only test the code that gets called 80% of the time

One way of assessing risk is understanding the common code paths that your product uses and writing tests to cover them. Writing Sock Shop? Then unit / integration / smoke / ui tests should ensure that the user can add socks to their cart.

### Only test the happy path

When writing a new interface, it's the perfect opportunity to add happy path tests to assert that the defined contract is satisfied by the implementing classes. If things like index errors are never really encountered because you're using functional programing techniques, then maybe that IndexOutOfBounds test isn't very valuable.

### Keep the dev loop __fast__

I prefer this interpretation to the others because it promotes the notion that confidence in one's product should be assert-able in a moments notice. Imagine being on a tech support call with a customer and DM'ing a developer to ask "are null values allowed in the XYZ table?" and the developer responds with, "brb I need to deploy the product to staging to find out". Scenarios like this devalue the time of the developer, degrade the trust of the customer, and are expensive in terms of opportunity cost for the company.

I believe, as a developer, the less time I spend mind numbingly pulling levers and turning dials, the better.

Some may claim, "But Grayson! My situation is different! My product runs in an environment that is non-conducive to testing and I can't afford to invest in building a custom test harness!" This line of thinking fails to consider the long term maintenance costs of the product and the human nature of developers. If they must conduct a monotonous sequence of actions in order to assert the quality of the features, one will soon find that cost-estimates for features now must account for not only the manual cost of testing the old features __but also__ the cost of manually testing the new features. By not investing in a test harness, one exposes themselves to substantial risk of linearly increased development and maintenance costs. Oh, and developer burn out. Do you _really_ want to type that default password in to the login page? Or click through that first-time-experience dialog?

It's worth noting that development costs may not be affected if engineers find the using the manual testing strategy too laborious and omit it from the dev loop entirely. One may find the total cost has not changed, but rather the development cost has merely shifted to the maintenance column due to the increased rate of production bugs introduced when untested code gets shipped.

Lastly, the risk of shipping test-able bugs is proportional to the number of manual test sequences, meaning the longer one waits - the worse it gets.

## Creating New Things

Though old habits and lethargy some times make it challenging, I find myself often attempting to write tests _before_ writing the feature code. Yes, yes, Kent would be proud that I'm drinking the punch. However I would invite you consider the situation of "The Unwritten Interface":

There are moments when building a product that a developer gets to create something new, to conjure something from thin air. In the land of test cases, the canvas is blank, the pallette undefined, and the brush in hand. In this realm, the developer is free to dream and to unleash their abundant creativity in order to craft something worthy of sacrificing to the alter of all that is good and beautiful.

There's a small caveat - whatever is good and beautiful may not compile, but for now that's okay.

Techniques like Test-Driven Development (TDD) give the developer room to run and experiment with what feels and looks right. Unconstrained by dependencies, control flow, or the call stack.

In order to get things to compile, whatever hasn't been defined can be stubbed and dependencies can be mocked. This freedom allows one to broaden their perspective of the problem in order to arrive at a solution without being burdened by the blinders of convention.

Once the solution is found and the test result turns green, one can start to replace mocks and stubs with real implementations in an incremental fashion such that the nature of the interface is preserved.

Then once the base case is working, they may move on to exercising the interface further, adding test cases and refactoring the design as appropriate.

In comparison, one could also design the interface _in the codebase_ in the land of _what is_ and _what works_, however they run the risk of over-fitting the solution to the problem and their time spent designing may be doubled if they failed to make the interface testable and therefore in need of a refactor when the tests are eventually written.

## Conclusion

Though the upfront cost of testing a product may seem expensive, it is important to consider the installments paid every time _that one feature_ needs to be worked on or a bug is found in production shortly after releasing. Having confidence in the product and fast dev loops is sometimes worth the overhead of investing in tests and test-driven development.

_If it means there's a better chance one won't get called at 7PM on a Friday after having a beer, it's worth trying right?_

## Further Reading

- Test-Driven Development By Example - Kent Beck

If you found this post helpful or would like to fuel my caffeine addiction, [consider donating.](https://ko-fi.com/wghilliard)

Book Review - The DevOps Handbook

2020-01-12T00:00:00+00:00

WIP Disclaimer - All content is subject to change!

At one point in time, you could have asked me, “Grayson, how would you measure quality of a software product?” and I would have most likely answered with, “By how fast / accurate the code is!”

After a few years of working in various industry environments, I have become familiar with ways in which a team might approach developing software. Sometimes they focus on getting the MVP completed as fast as possible, sometimes they want an MVP as cheap as possible, and sometimes they want an MVP as good as possible. However, it never occurred to me that “good” was really a component of cheap AND fast. It wasn’t until I began reading “The DevOps Handbook” that I realized that “good” is usually often thought of as turning the “quality” dial up to 11, where some algorithm performs faster or the product is more feature complete. However, I seldom remember being asked the question “How do you know you what you’ve built is good?”

Such a question could be interpreted as an accusation in some settings, with ones rapport being challenged. In other settings it might be simply answered with slides and graphs depicting the product’s performance over previous iterations or implementations. However another interpretation would consider how trustworthy the product is, an incredibly important attribute that I now believe is wildly underrated.

The DevOps Handbook could be potentially summarized with the following sentence:

One cannot claim quality unless it is regularly asserted.

These assertions can be executed in multiple ways:

Automated Testing
Production Telemetry

Automated Testing

Automated testing allows an individual to claim with authority that the thing they have built is trustworthy and has quality. An argument could be made that automated tests could be written in such a manner that they are effectively meaningless, but I would challenge the reader, for the sake of this post, to assume that an individual that was practicing this methodology put in the necessary effort to write meaningful and complete < unit | integration | end-to-end > tests. (1)

The DevOps Handbook claims that such automated testing allows for value streams to produce meaningful quality-metrics and gives teams a litmus test to determine if their development process is safe. These tests would largely depend on the domain and architecture of the product in question, but if done correctly they decrease risk when releasing the product by revealing bugs early on, facilitate higher functioning workflows (measured by lead time), and give external (and internal) customers faith that the product will work as intended when it is delivered.

Some environments see an investment in to automated testing as a nice to have almost as if it were a feature to be implemented in the next major release. However, this perspective trivializes the costs associated with last minute refactoring, emergency hot fixes, and most importantly - brand degradation. A deficit of automated testing leads to the product accruing a compounding mountain of technical debt resulting in a codebase so fragile and mystifying that even senior engineers might wonder how the product even worked in the first place. Spooky. (2)

So, enough with the war stories, let’s pretend we’ve got automated testing! Depending on the VSC flow the team is using, “quality gates” can be constructed that give the team (and all other downstream consumers) faith that the product will do everything it claims to do. (3) These gates can be arranged such that the intensity and scope increase with each level as described below:

Gate Number	Examples	Run Time
1	type checking, linting	milliseconds to seconds
2	unit tests - known IO, light mocking, accurate	seconds to minutes
3	integration tests - module composition, heavy mocking, behavioral	minutes to hours
4	end-to-end tests - no mocking, “real” services, api or ui driven, small data sets	minutes to hours
5	performance tests - similar to end-to-end but with larger data sets	minutes to hours
6	manual testing - human + checklist driven	minutes (4)

The above gates are just a guideline, it may make sense to mix and match for a given product due to the domain, team, technologies, etc.

Production Telemetry

Production telemetry helps answer the question “Is what we delivered providing value to the customer?” Without a mechanism to determine if a feature is providing value to the customer in an objective fashion, developers (and UX, QA, etc.) are left with anecdotes to describe the customers satisfaction or dissatisfaction. Naturally only the extrema of the feedback is reported and the development is guided by the outliers. Should time and energy be allocated to rebuilding / improving a certain feature? How do we know that the investment will provide value if we have no way to determine if our customers find that feature valuable? Moreover, the team is not empowered to experiment because it is arguably impossible to determine if a given change will improve or disrupt a customers workflow within the product. Therefore, instrumenting the product will give the development team visibility in to how the product is being used, and will objectively assert whether or not value is being provided. (5)

Some examples of frameworks that facilitate telemetry include:

Footnotes:

Some real-world architectures and brown-field projects make certain types of testing a seemingly insurmountable task.
If there are no tests to assert quality / desired behavior and a “bug” is found, is it really a bug?
Or at least what the tests claim the product can do.
Ideally human testing is reserved for edge cases and difficult to reproduce scenarios, but this isn’t always the case.
Sometimes it’s not possible to derive a metric from aesthetic things, e.g. font or page layout. In this case A/B testing can be used to help answer questions about qualitative features.