Oliver Gilan

Software Crisis

A couple weeks ago every flight in the US was grounded due to the meltdown of the FAA’s NOTAM system in the latest large-scale example of the software crisis. The term was coined in 1968 to describe the rapidly expanding pace of power and use-cases for computers and the inability of the industry to keep up with the appropriate quality of software. Computers became exponentially cheaper and more powerful and thus the role of software grew but the majority of the software created at the time shared characteristics that defined the crisis:

  • Over budget, behind schedule
  • Frequently never completed
  • Poor performance
  • Difficult to maintain
  • Did not meet the requirements of the end-user

Over the next decades some of the brightest minds in the field set to providing solutions for the crisis and out of those efforts we got a number of new languages, programming paradigms, philosophies, and methodologies. Procedural programming and C brought a paradigm shift in the way programmers reason about and build software. Later Object-Oriented Programming and languages such as Java did the same. Methodologies like waterfall, spiral, rapid, incremental, continuous integration, etc. started popping up all with the aim of taming the complexity and solving the crisis– or at least a part of it. Then in February 2001 the Agile Manifesto was released and took the industry by storm. Today every software team from the smallest startups to the largest enterprises proudly wave the flag of the Agile development process.

And today, 2 decades after the birth of Agile and with further developments such as BDD, TDD, DDD, and every other acronym you can think of, the characteristics of software that marked the crisis are as prevalent as ever. NOTAM melted down because a contractor introduced incorrect data; because the system let him introduce that data. Over a million organizations use Microsoft Teams, a basic chat app that constantly crashes, consumes memory, and straight up fails to work on many machines. Twitch, a streaming platform with 31 million users, takes upwards of 10 seconds to render on my 6 year old MacBook. Companies (and the NSA) routinely expose customer data by forgetting to add authentication to their public AWS S3 buckets. Outdated software keeps people in prison when legally they should be free. Browsers have a never-ending appetite for RAM and now desktop apps built on those browser engines have adopted a similar palate. New front-end projects using the most popular frameworks are saddled with dependencies for dependencies for dependencies each one increasing the surface area for attack. Peek into the average enterprise codebase and you’ll find hundreds of lines of boilerplate code, unnecessary abstractions– factories and providers and complex inheritance schemes– along with dependency injection and a variety of other patterns complicating code so that a useless unit test becomes a little more sane to write. Kubernetes and microservices designed to decouple independent pieces of software now add latency, orchestration complexity, and interface coupling across multiple runtimes and software teams. Large-scale government software systems routinely experience outages, data leakage, and exploitation and the default state for new government software projects is failure. The crisis is ongoing.

The Standish Group conducts a study every year on the state of software development in the industry and publishes a Chaos Report with its findings going back to 1994. Agile dominates the industry as the primary methodology employed in most software projects yet the success rates are still dismally low.

Software Project Outcomes Table

Software project outcomes by project size

Read the full 2015 report for definitions on project size and resolution status. When the project size is small success rates are high and only a few projects fail. As projects increase in size the success rates drop with a 41% drop just from a small to a moderately sized project. We built solutions to the problems software development faced in the 1900s but those solutions were outpaced by the requirements of software as it ate the world. The outcomes in the Chaos Report read like an O(n2) algorithm and our efforts to solve the crisis are being dwarfed by the exponential growth in importance and scale of software.

In retrospect OOP is mostly a bad idea for software at scale (that’s for another time) and it’s unsurprising that Agile didn’t overcome that to solve the crisis. The original manifesto is only 12 principles and 68 words. There’s too much room for interpretation and not enough specifics on how it should be applied and used in various software development scenarios. In a sense dedicating effort to solving the crisis mirrors the problem most teams face with tech debt: you don’t need perfect software to be successful. You can build the most technically sound application but it won’t generate any revenue if it doesn’t actually solve a problem and add value to customers. Inversely you can build wildly successful products with horrific code because the bar is so low and people will use something imperfect if it solves their problem over something perfect that doesn’t. The startup graveyard is littered with technically sound products that lost because the technically unsound products actually shipped.

The problem is made worse by the concentration of software engineering talent. As software eats the world its importance in every industry from agriculture to manufacturing to defense to logistics to government elections and beyond is growing but the most talented software engineers are found in the companies focusing exclusively in the world of bits and predominantly in the big tech companies. It makes sense for engineers to aim for those companies: regardless of the pay, lifestyle, and status, (all of which are significant by the way) those big companies were once small companies pushing the boundaries of technology and revolutionizing the world with new search engines, operating systems, social graphs, etc. Now a top engineer going to work for Google will utilize their talents far less than if they modernized the software used in global shipping instead. Startups push the needle on this problem of talent concentration but they do little to address the hundreds of legacy companies and industries that affect our everyday lives and need software talent. There may not even be enough software talent to address the needs of all those companies and any solution to the software crisis will have to address that.

Progress is being made. Firebase, Supabase, and Pocketbase bring backends and databases to the world of front-end devs and full-stack frameworks like NextJS, SvelteKit, Remix, etc make it easy to build and manage an application of decent scale with just a couple engineers. User-friendly cloud platforms like Heroku, Digital Ocean, Render, Fly.io, and others have removed the deployment complexities of small projects so teams can focus more on the product and less on the infrastructure. The tools to build software at a small scale are better than ever catalyzing an explosion in the number of individuals and small teams building profitable products. These improvements, however, are confined to building small-scale, highly focused, software-only products focused on solving specific tasks and have not affected the outcomes of software at scale. The scale and requirements of the software needed in enterprise and nation-scale systems is a whole different beast and gains in small-scale software development haven’t translated over.

If 90% of bridges built collapsed before they were finished the world would look very different than it does today yet 91% of medium sized software projects fail and our industry treats this as normal. Just as competent engineering teams should dedicate time to address technical debt so too should our industry re-acknowledge the ongoing crisis and dedicate efforts to solving it. We need solutions that allow for consistent success in developing good software at scale in our most important industries. That might start by creating an industry-wide definition for what good software even means. I have no big answers but I have conviction that this problem of software development can be solved. There won’t be a silver bullet that can address software creation in every scenario but the majority of software systems can surely be classified into a handful of categories with best practices and methodologies for each. Likely those methodologies that work best for each scenario already exist and what’s required is the synthesis of these ideas into frameworks for management, engineering, and design along with the subsequent education and adoption by the industry at large.