Insights From Software Architecture: A Tale of Two Systems
A colleague understands a new system using his software architecture skills, and I try to figure out how he did it.
Part 1 (How is a Software Architect Like a Football Coach?) introduced the idea that developers can reason about the behavior of large scale complex software systems using high level abstractions that come from software architecture principles.
A Tale of Two Systems
A former coworker was once recounting a work experience where her team had an enterprise system that was getting dangerously close to exceeding its capacity. The large expensive Windows servers they had were fast enough to handle the load, but some tasks were exceeding the limits of what could fit in a single Win32 process. So they undertook a major overhaul of the system’s architecture to make it more scalable.
The new system had a different architecture, but functionally it was intended to be equivalent. Their strategy was to deploy the two systems in parallel so that any discrepancy in their outputs could be tracked down and fixed.
They expected that the new system’s state-of-the-art highly-scalable architecture would significantly outperform the old system’s more primitive and over-extended architecture. To their surprise, when they measured the performance in production, the new architecture was giving them no performance boost at all. In fact, the new system was significantly slower.
What happens next? Panic? Finger pointing?
My first thought as a software engineer was that somebody must have done something wrong. Another coworker whose experience includes a background in software architecture also heard the same story, but his reaction was different. He said that of course the new system was slower. What?!
I’m starting to notice that it is quite useful for a software engineer to know a bit about software architecture.
First, the long answer for why he was right.
Here’s a sample piece of the system. The program’s task was to read blocks of data from input files, add together all blocks that are the same type (i.e. same color in the diagram), and write the combined blocks to the output file.
The output file was opened as a memory mapped file, which means that the program interacted with the file as if it were just a big block of virtual memory. What’s cool about a memory mapped file is that it takes the same hardware-accelerated machinery that makes virtual memory fast and uses it to optimize reading and writing to a file.
And this was the problem. In 32-bit Windows, the maximum memory-mapped file size is limited by how much contiguous unused address space is left. The bigger the output file gets, the more cramped it gets.
In the new architecture, the program hadn’t changed much in the way it worked, except now the input and output “files” weren’t really files anymore. Instead of writing to a memory mapped file, now it wrote to some remote procedure call API for a clustered memory data grid. The data was stored on another set of computers.
Well, if you put it that way… I see why it makes sense that it’s slower. So what were the system architects thinking?
Remember that the original problem was scalability. Until now, if the program ran too slowly, they solved the problem by buying a bigger server, a so-called “scale up” tactic. That worked until they began to hit other barriers. To get beyond those, they needed to change the architecture to something that provided what the old architecture was lacking -- the ability to scale beyond what can fit inside one machine. But that ability to scale comes at a price in performance.
So let’s say that the old software could finish the job with a million records in 30 minutes, but the new architecture running on the same machine takes an hour. Is that a problem? What about if the job grows to 2 million records and takes 2 hours on the new architecture, whereas with the old software you could finish the job in… NEVER. You couldn’t do a job that size on the old software. It would simply abort with a memory allocation error. Now that’s a problem.
But what if 2 hours is a problem? The new architecture allows for that problem to be solved in a way that the old architecture couldn’t directly support, and that is to add more computers, a so-called “scale out” tactic.
Architecture Insight
I could see the logic after having it explained, but how is it that the developer with the software architecture training knew the answer right away, but I had to think about all this detail stuff before I could see it too?
He was looking past the details to see underlying patterns that were familiar to him, and once he could see the patterns, he could discard the irrelevant detail and just reason about the patterns in the abstract.
This kind of insight is valuable for all developers. If the rationale for all the architecture design decisions is to increase the system’s scalability, then you can make sure that your smaller design and implementation decisions remain consistent with that goal.
See part 3 (Developer Happiness).
Thanks for reading. Please like and share. You can find my previous LinkedIn articles here.
Profesora de Secundaria-Tutora UNED en IE"12 de Outubro"-Ourense-UNED
9 年Great article.
Head, Technical Product Management at EagleView
9 年This is a great example of something that I have seen happen multiple times in past projects that were revamped for scalability. Very succinctly put!
Cybersecurity Analyst
9 年Thanks David.I appreciate your beatiful and hard job
Director General en PROAQUA
9 年thank you David, great article
Bridging Business Goals and IT Solutions
9 年Thank you for tackling an enterprise IT topic. Looking forward to part 3.