“You cannot neglect your infrastructure. Just like the world around you changes, so must it to keep up with your growing business. You ignore this at your peril.”
I’m going to take a break from the Saxy Inc. series to provide my $0.02USD on the debacle of the previous week: the meltdown of Southwest Airlines. Especially since I was one of the unfortunate ones that got caught up in this. Unlike many others, I was able to escape mostly unscathed. This is a personal story, and a cautionary tale on neglecting and delaying modernization efforts.
For those that may not know me personally, I have two passions: technology and aviation. My experience with computing stretches many different technologies and industries from the national supercomputing centers to traditional brick and mortar companies and across the spectrum from working on the bowels of the operating system at the assembly level all the way up to websites and distributed computing. I also love aviation from being a member of the civil air patrol cadet corps in high school to being an instrument rated private pilot. I’ve been lucky enough to get some 737 time in a full motion simulator. Yes I am the guy who always gets a window seat, and still complains that the view isn’t as good as up front.
I also enjoy flying with Southwest as the unofficial airline of California. Their customer experience has always been top notch from the ticketing and gate agents all the way down to the ramp. They’ve been able to work magic many times over the years, and it was heartbreaking watching how those on the ground (and in the air) have been stretched to their absolute limits over the past week.
It has been interesting watching the news and the trickle coming from the mainstream press going between the weather and legacy systems as the cause for their cancellations, with more emphasis on the weather and 90s tech. Yet it’s been fascinating reading the “from the trenches” posts on Reddit. While much as been mentioned about the 90s technology running the company and the unique way they operate, the bottom line is that they never gave the systems that powered their novel approach to routing the required care and feeding needed to allow it to continue operating into the future.
While you may say it is a legacy system, you wouldn’t be entirely wrong. After all, a legacy system is an outdated system that is still in use. The system still meets the needs it was designed for, but doesn’t allow for growth. I posted a question on Twitter a while back on what comes to mind with regards to legacy code or systems. The lone consensus was more towards abandonment. However, this part doesn’t seem right, as there are a lot of core libraries that are rarely touched, but still work admirably over the decades such as libc. It must be something else.
Teasing the definition apart, two questions come up:
- What makes a system outdated?
- Why doesn’t it allow for growth?
My interpretation for hardware being outdated would be closer to a classic car. A 1967 Ford Mustang is very much outdated compared to today’s cars, and yet there are those who still own, drive, and maintain them. Additionally, I’m willing to bet if you know how to drive a manual transmission, you could drive one and get where you need to go.
For computers, this isn’t necessarily true. New features are added to computers all the time: faster clock speeds, larger instruction sets, memory, etc. Additionally, software developers will leverage all the new bells and whistles of the new platform while completely disregarding the older technology. Now the part that would make a system “legacy” is a willingness of the manufacturer or third-party suppliers to support the component. One example is the support of the floppy drive which lasted until Apple began to remove it from their computers resulting in the rest of the industry following suit. Same with the audio jack that used to be prevalent on cell phones.
Surprisingly enough, mainframes are lobbed into this category as well, but what most people tend to forget is the hardware is built to a much higher standard and doesn’t quite fit the typical mold of a typical PC. The same is true for HPC systems. The hardware is designed to run faster, and longer. Much better hardware means they can last longer (on the order of years) than a typical PC where the average turn-around time is about 18 months.
No, the part that makes these kinds of systems unmaintainable from a hardware perspective is parts availability and finding someone familiar with the underlying architecture since it is not a typical PC.
So that leaves us with why doesn’t it allow for growth? As per the reddit post I linked to earlier, Southwest has grown quite a bit over the last decade, but infrastructure was never prioritized with the previous leadership (the Gary Kelly years). Infrastructure is quite insidious in that it will take years of neglect before it begins to have a serious impact. At first, it’ll be small, but can grow over time until it can no longer be ignored. Repairing the damage and neglect will take an even longer period of time, and that is where Southwest is at right now. Luckily, it seems like Bob Jordan is a part of the old guard that sees infrastructure as a priority. Unfortunately, he had little time to get things in place before this week, and will be spending years cleaning up the mess.
Why bring up the lengthy time for infrastructure changes? Isn’t it no big deal to make a software change? Why didn’t they foresee this in the beginning before things went south?
As much as we’d love to think of software as something that you write once, it is an abstraction around the world around us. As our understanding of the world changes and we discover new ways to represent this world, so must the software. One of the more interesting things about software is there can be more than one correct solution to a problem, and sometimes one solution is more correct than another based on the situation.
Take the scheduling fiasco: I’m sure their solution for tracking flights and personnel would have been fine for a company that is about 1/3rd the size it is now, and we wouldn’t have been the wiser. In fact, the pandemic cut the number of flights considerably bringing the numbers closer in line to where things were at several years ago, buying time until this week. For example: in 2019 there were 38.9 million flights (globally) with the pandemic dropping that number to 16.9 million by 2020 which is well below 2004’s number of flights. It is estimated that this past year saw up to 20.1 million in 2021 and an estimated 33.8 million flights at the end of 2022 which makes for quite the uptick, and would hit a few snags in the process.
While we’d all love to blame 90s style technology, you cannot gut everything to replace it every couple of years either (contrary to what Elon Musk says). Putting things in containers and running them in the cloud doesn’t necessarily mean it is instantly scalable nor does attempting to replace or outsource the custom code that has powered their unique routing capabilities for decades. However, there are things that can be done where things can be optimized and updated to allow for greater scalability and ease of use. It does take time.
I am positive those with the expertise and sitting on the front lines have seen the warning signs long before last week, but have had their pleas ignored. The growing number of incidents, the longer periods of time to recover from an adverse weather event earlier in the year. The degradation of normal day to day performance. While you can ignore them, just as the prior management did, you do so at your own peril.
Unfortunately, for the current management, there are no quick fixes. Things such as providing an internal portal for the crews to use to obtain their daily routes and report diversions without having to sit on hold waiting for hours to update the operations center would have saved considerable time as well as impact on crew duty shifts. Allowing for concurrent changes to the system would have allowed more changes to be accepted and prevent disaster. While software is malleable, it does take time to not only develop a fix, but deploy and operationalize it. After all, there isn’t really a second airline that can be used as an integration testbed.
While what I suggested above is not outside the realm of possibility, these “simple” modifications could easily take many months, if not years. Without knowing how the underlying system that tracks the entire fleet of aircraft and personnel work, it would be impossible to make snap changes. Doubly so for the underlying hardware if it isn’t your run of the mill PC environment.
Things such as teasing out which parts of the system are impacted, fleshing out bugs, performing trial runs, and deploying throughout the rest of the company take time. Plus, just like the case of hardware above, finding people who are familiar enough with the underlying software and have the confidence to be able to change it without making things worse is also quite a challenge.
While the prior leadership was able to escape the consequences of this past week, the current leadership inherited a mess and, even if things hadn’t imploded the way it did, they will be stuck cleaning up the mess for years to come because, just as it takes a while for a train to stop due to momentum, it will take as long to come up to speed again.