Lufthansa IT meltdown on Feb 15th. FAA outage on Jan 11th and Southwest' planning software woes over Christmas have one thing in common. They are caused by a mismanagement of technical debt and poor incident management disaster recovery practices. I bet that one day of flight cancellations cost Lufthansa a good fraction of its IT budget.
Try searching LinkedIn for "technical debt" and you will see plenty of posts highlighting the problem but not offering a solution. Keep reading if you are a CxO executive and feel defenceless when presented with the daunting question: "Should you develop a new digital product or reduce your technical debt first?"
Your IT or R&D Head will rarely explain what is behind technical debt, bringing ambiguity into decision making. What is technical debt and why is it dangerous? Technical debt is the cost (human, infrastructure, licensing, etc.) of supporting a working system or product, incurred during maintenance or sunset stages (Waterfall Model). Here are some examples [1]:
Engineering time spent on moving from obsolete software libraries or hardware platforms to up to date systems (also called refactoring);
Software or infrastructure upgrades to make it more cost, resource or energy efficient or more maintainable (ie reduce the R&D time spent on maintenance);
Time spent on improving existing security features or protocols;
Efforts to improve product reliability or availability, including action items following from past service outages or incidents;
Addressing deficiencies introduced by technical tradeoffs made by predecessors.
Digital Natives, including the Big Tech, face the challenge of balancing resources between managing technical debt and developing new features and products on a daily basis. Not only have these companies found a solution that works, there are publicly available materials on how to apply their learnings. There are three key components to the practice:
1. Measure technical debt and its impact
SLAs, SLOs and SLIs (Service Level Agreements, Objectives and Indicators), are similar to business KPIs, but, instead of measuring key business success metrics, they measure user experience, service health, reliability and maintainability. IT and R&D teams usually stop with "uptime", ignoring other indicators and connecting them to business needs [2]. Riccardo Carlesso authored a great workshop called "The Art of SLOs" that will help you.
Taking airlines as an example: measure the rate of manual rescheduling (i.e. how often automatically generated schedules had to be manually adjusted) or parameters highlighting that a system may cause an outage (in a case of Lufthansa), such as time to release, number of rollbacks due to critical bugs etc.
2. Implement a tradeoff policy (error budget)
Digital Natives and Big Tech use the concept of "error budget" to balance between maintenance and innovation. Budget can be defined in a SLA/SLO/SLI "currency" or proxy equivalents and is "charged" when a violation occurs. Should a budget be used up within a defined period of time (i.e. a week or a month), innovation stops and all available resources are put on maintenance until SLA/SLO/SLIs improve [3]. In this case, Southwest' IT department should have seen deficiencies in the planning software earlier and, most importantly, they would have had a great ROI story for improving it.
3. Manage change
Embed the aforementioned concepts into existing planning frameworks, including project/program management (waterfall and agile implementations), product roadmap and connecting the SLAs, SLOs, SLIs and the error budget to key business and customer metrics. Here is a prime benefit of using your error budget - if your SLOs and the error budget are set well, you will have no trouble convincing the rest of the teams of making a tradeoff. Furthermore, you will have an effective tool at your disposal and will no longer be defenceless when being caught by the dilemma of technical debt vs innovation.
If your company adopted #agile software development, you may find it faster and easier to implement such policies, but it also works with waterfall.
Consider three articles to get you started and reach out to me, Danila Rudenka, if you need help.
Steven Thurgood: Example Error Budget Policy (SRE Workbook)
I picked on aviation because many people were affected by these outages, but not because this industry is particularly good or bad. It is just relatable. Other conventional industries experienced similar or worse problems. Here are some recent ones (courtesy of Financial Times):
Glitch at NYSE briefly halts trading in dozens of blue-chip stocks
Companies including ExxonMobil and McDonald’s among ‘subset’ affected by problems https://on.ft.com/3j0gMl8
TSB fined £49mn over IT outage
Bank took 8 months to return to ‘business as usual’ after botched tech transfer, say regulators https://on.ft.com/3BNUREa
Solana wallets ‘drained’ in blow to crypto network
Thousands of accounts hit in apparent hack that marks a setback for the blockchain https://on.ft.com/3Jq7g3s
American Express payments hit by outages across global network
Customers unable to use card company’s website, mobile apps and 2-step verification system https://on.ft.com/3iWv4Q5
Singapore regulator warns DBS Bank over prolonged service outage
Problems at south-east Asia’s biggest lender come as city seeks to gain edge as regional financial centre https://on.ft.com/3xk6Tl8
‘Tech debt’: why badly written code can haunt companies for decades
Maintaining and building on top of old software can be costly — and deter innovation https://on.ft.com/3mN1P2f