Solving the Mythical Man-Month: Scale Out Programming

Solving the Mythical Man-Month: Scale Out Programming

The Mythical Man-Month: Essays on Software Engineering has itself taken on a mythical status among engineers and engineering leaders.  One of the core concepts introduced in the book by Fred Brooks is that there is a limit to parallelism in software projects.  Quoting from the book: "Men and months are interchangeable commodities only when a task can be partitioned among many workers with no communication among them."  The author observes further that some tasks, much like the gestation of a baby in a womb, can't be accelerated through parallelism.  He therefore argues adding more engineers, to a project (particularly late in the project), only serves to slow the project down.

From what I have seen over the last 20 years, I think there has been a significant, yet under-reported, shift in how large software systems are engineered.  We're now seeing examples of very large complex software systems being engineered in a way where tasks have been partitioned among many workers and communication overhead has been minimized.  I am coining the phrase "scale-out programming" to describe this phenomenon and will describe a few examples where I think we're now seeing an entirely new style of software development.

First, a refresher on scale-out vs. scale-up.  These terms are most frequently used when thinking about how to distribute computational work onto physical computers.  The idea behind scale-up is that you continue to need and want a single machine to get more and more powerful.  You can write a single program that can rely on incredibly high-speed communication and access to shared memory to perform computational work.  Classically larger and larger databases to handle bigger and bigger jobs against more and more data require a more and more powerful host... unless you find a way to distribute the work among many nodes in a way that minimizes the communication overhead.  This latter technique is termed scale-out.  Instead of using a more and more powerful machine, you use many machines to scale-out the work. 

There's an equivalence in software programmers.  Through a combination of experience, judgment, typing speed, code editing prowess and a preternatural ability to see and reorganize code in their heads, some programmers are the equivalent of scale-up machines.  They are "10x" more powerful than the average programmer.  I have not seen empirical evidence in support of the concept of a "10xer programmer" and the best I can tell this phrase was adapted from the investigation Jim Collins' did into companies that did empirically outperform their peers by 10x over a given period for his book Great by Choice: Uncertainty Chaos and Luck—Why Some Thrive Despite Them All and then the concept was applied to programmers.  But I will also say that anyone in the tech industry, while almost universally demurring when it comes to referring to themselves as a 10xer, will acknowledge there are people they have worked with who simply outproduce others by many multiples.  I refer to these masters of their craft as "scale-up programmers". 

The problem is there's no Moore's law effect going on for scale-up programmers.  The best of the best may be getting small percentages better over time, but they're not improving in productivity and output at the same rate processor speeds have been increasing.  On the other hand, the demand for their talents has been going up and up, so the compensation they demand has been going up seemingly exponentially. 

The scale-out solution to programming requires finding ways to build software with a very large number of very good programmers, rather than a very small number of the world's best.  If you look at the history of most operating systems every written, they have almost all been written by a very small number of the world's best, particularly at the core. 

Arguably Microsoft could be considered the first to crack scale-out engineering as they scaled-out software development on the Windows OS, but they also seemed to hit the constraints Fred Brooks predicted.  Notably, releases have taken longer and longer and the internal communication overhead did increase as evidence by the rumored very high ratio of program managers to developers. 

Amazon cracked the code on this a number of years ago by moving to a very distributed service-oriented architecture with the intention of minimizing the communication and coordination overhead required when evolving the systems very quickly. 

Most recently, a Facebook engineer gave apparently a very interesting talk at iOS Dev UK 2015 discussing how Facebook has achieved scale-out programming and why that has led to a 100MB+ iPhone app with over 18,000 classes (as documented on this tublr post).  Unfortunately, that slide deck for this presentation may have been too revealing and has since been pulled.  But if you look at the scale and growth of Facebook's engineering organization combined with the fact that they haven't seemed to slow down in their ability to innovate, you have to hypothesize they have solved for scale-out programming.

What we see with scale-out style programming is because there is less project-level communication and a much greater level of independence between teams contributing code, you also see redundancy and 'bloat'.  The old-school master programmers tend to hate bloat and redundancy because historically compute resources were precious.  Moving bits around was expensive.  Large binaries didn't compile in reasonable times.  Waste and redundancy were the enemy.  Now what we're seeing is as computers have gotten faster, the compute infrastructure is no longer the bottleneck and programmers are.  

One of the most interesting examples of the difference between scale-up programming and scale-out programming came in the form of the analysis of two forms of malware that were apparently used to counter Iran's nuclear ambitions.  The first was Stuxnet. In David Kushner's "The Real Story of StuxNet", he paints a picture of what I would call classic "scale-up engineering".  The engineers of the Stuxnet virus appear to have been masters of their craft.  The sophistication, simplicity, precision were unlike anything the anti-virus community had seen before.  StuxNet was a small piece of master craftwork.  Smaller than most iPhone images or Word documents these days.  

Now, compare this to the Flame malware.  Flame appears to have been used to gather information from Windows based machines.  But compared to Stuxnet, Flame was 40x the size and over 20MB in base installation. In the article Meet 'Flame,' The Massive Spy Malware Infiltrating Iranian Computers Wired reports that Flame "contains multiple libraries, SQLite3 databases, various levels of encryption — some strong, some weak — and 20 plug-ins that can be swapped in and out to provide various functionality for the attackers. It even contains some code that is written in the LUA programming language — an uncommon choice for malware."

Flame appears to have followed more scale-out style programming with redundant functionality, plug-in components, rapid development scripting languages, inconsistent programming models.  All signs that suggest to me it was a bigger team of programmers who may not have been at the same master-level as the Stuxnet programmers. 

Looks like even the entities that conduct this type of highly targeted work have had to make the shift from scale-up to scale-out programming. 

Scale-out programming is here to stay because there simply aren't enough master-craft programmers with years of experience out there relative to the ambitions of today's tech industry.  The challenge is how do we now begin to add the layers of automation and tooling necessary to help the masses of programmers stay lean while moving fast in parallel. 

And if you're going to target my computer with malware... please show some self-respect and leave SQLite3 out. 

(Note about the image.  From what I have been able to track down, the image is a WWII soldiers identification processing center.  An interesting example of scale-out engineering to support the war effort.)

_______________________________

About the Author: Brad Porter is a veteran of the Internet boom and has spent the past 20 years helping some of the most innovative and fastest growing companies scale their organizations and technology platforms.  If you like this post, please share.

Other posts by Brad Porter:

Arnav Aviraj Mishra

Senior Engineer at Qualys | Master's IIT, Bombay | Former Research Fellow DRDO | eBPF and security enthusiast

11 个月

Thanks, Brad Porter for the lovely article. I guess now with GenAI's rapid growth the scale-out programmers have a greater chance of being replaced by the machines they had been building.

回复
Harald Ujc

Dynamic Leader & AI Innovator | President & CEO at Invenci, Inc. | Senior AI/LLM Web Engineer at CrossLeaf Web Engineering

8 年

Enjoyed this. I've been out of the 'craft' for a few years now and found your post a great 'catch-up' on the state of affairs.

回复
Mikhail Garber

Principal Software Developer | Lead | Ex-Amazon | 30+ YOE

9 年

I think there is serious misunderstanding there. Problems identified in MMM are still very much there. In fact, they got worse due to declining quality control in software development. But, in properly-distributed environment, these problems are pushed way down, deep inside individual (micro)services, so while individual pieces still do not scale, organization as a whole can move forward quickly. But, as Fred Brooks also said: there is no silver bullet. Distributed model often introduces horrific integration and testing problems that may cost you even more.

回复
Telmo Félix

Chief Baking Officer

9 年

"Amazon cracked the code on this a number of years ago by moving to a very distributed service-oriented architecture with the intention of minimizing the communication and coordination overhead required when evolving the systems very quickly." This is the central paragraph. It's a known thing, but still being taken very slow by industries that like to work on a command and control paradigm.

回复
Matt Fitzgerald

Director (Core Learning Technologies)

9 年

Probably why I could never be a dev in the industry again!!

回复

要查看或添加评论,请登录

Brad Porter的更多文章

社区洞察