Is AI Energy Demand Spinning Out of Control?
The human brain apparently only needs around 20 Watts of power to run.? That is a staggeringly small amount when you consider that when asked a question, you are able to quickly generate an answer based upon your lifetime of learning and at some level, understanding. A mix of dry fact, experience and something else unquantifiable we call the creative spark that is unique to you.? But, this oft-romanticised view of intelligence is being unpacked at a fairly alarming rate with the current pace of AI model development that is reaching far beyond any single human capacity of knowledge.? The [industrial] point of AI is to do this very thing, so that more may be done in the same unit of time.? However, an AI inference model with only single human-level knowledge needs something like 500,000 - 1 Million Watts to run compared to our meagre 20 Watt light bulb of a brain.
This raises two major points for consideration:?
(i) Does our intelligence work differently from AI?, and;
(ii) What are the Energy Supply impacts of scaling AI?
THINKING ABOUT THINKING
At various times in our lives I am sure we have all paused to give thought to how arbitrary words are.? How do they create meaning for us?? Repeat the same word over and over and over again and quickly its meaning seems to collapse.? But, what is going on inside your head that can relate them together to form concepts and for you to start to make predictions about events in your next waking minute, hour, day or even longer???
Given the variety of human languages that have evolved it is clear that words are just labels with no intrinsic meaning.? Their relationship to one another is what really allows us to understand the idea of something existing directly outside our immediate physical surroundings.
For example, how would you as a human reader relate the following collection of words and what picture do you paint inside your head after digesting them??
‘Dish, bubble, sink, brush, boredom.’
I suspect none of you struggled to infer a context from this apparently random set of input variables.? In fact, you will probably pout at the thought of them even being random given the speed of inference your brain operates at.
[Read on to see how you did against the computer]
–
Asking ChatGPT 4o resulted in:
“The words "Dish," "Bubble," "Sink," "Brush," and "Boredom" all relate to the scenario of washing dishes. Here's how they connect:
The scenario you are thinking of is the mundane task of washing dishes in the kitchen.”
BOOM, what just happened?? Well, the Godfather of AI, Professor Geoffrey Hinton believes very strongly that the learning and relational ‘understanding’ required to process these random input variables is analogous to how our human intelligence is working when trying to make predictions about the world.? This is a very strong statement and one that he vehemently refutes is some clever auto-correct routine.??
BACK [PROPAGATION] TO THE FUTURE
Way back in 1985 when I was trying to recreate the Back to the Future skateboard scene by hanging off a passing BMX, Hinton was busy creating Tiny Language Models (TLMs) where he would feed triples of data to this early AI machine model.??
He was experimenting with Artificial Neural Networks (ANNs), an approach that directly eschewed the Symbolic AI view of language as requiring some foundational logic to relate words to one another.? Instead, he greedily fed triples of data to his machine with no pre-programming of how the words were related.
He provided statements like:
Charlotte has brother Colin
Colin has mother Victoria
With no other data, the correct inference here being that if Charlotte has a brother, Colin, who has mother Victoria, then:
Charlotte [also] has mother Victoria.
This is precisely the inferred output that the TLM produced.? What is striking is that to achieve the inference required no detailed seeding of relational logic.? Instead, he threw repeated rounds of input data at the machine and used feedback mechanisms to highlight errors until its accuracy improved to the appropriate level.? This process of learning, we all recognise and is how we as babies first learn to label our surroundings, eventually moving on to gabble semi-coherently until one day we are popping phraseology loaded with adjectives, adverbs and subjunctive clauses.? The feedback process is something Hinton called backwards propagation of error, or simply backpropagation, and is a foundational concept in AI modelling.?
The success of ANNs to attack the ‘language problem’ has also revealed something revelationary about human intelligence and Hinton believes strongly that we basically operate in the same way when learning, taking symbology as input (labels such as words), tying that to an explosion of ‘features’ in our squishy neural networks and generating an output.? This generational output is a polite way of saying ‘making stuff up on the fly’ and can be seen in humans as false memories.? As Hinton explains: “Memories are not stored, they are generated”.?
This imperfect method seems to allow a lot of creativity on the recall and I am drawn to think of all the false memories I have generated over the years leading to frantic searches for misplaced car keys as well my rather more serious participation in Jury service, where eyewitness accounts were later shown to be utter works of fiction, a phenomenon that is well documented.??
Imperfect our recall may be but it would seem mother nature has evolved a supremely efficient mechanism that allows our brains to ably adapt to novel situations and still infer meaning and predictions when presented with new data.? Importantly, this generative model means we don’t need hardware that can store Exabytes of exact data that could easily become corrupt upon a bump on the head… like if you fell off your skateboard pretending to be Marty McFly.? Machines on the other hand can do both; perfect storage and recall as well as generative aka creative output.? But, what they cannot do is perform generative tasks with anywhere near the power efficiency of the human brain [yet].
领英推荐
GO BIG OR GO HOME
Before we talk Skynet, let’s talk about what is happening now… Today we talk in terms of Large Language Models or LLMs for short.? These are massively scaled out descendants of TLMs that benefit from innovations in AI modelling and optimisation techniques.? They are also only possible because of the newest infrastructure technologies and architectures which can provide hugely distributed clusters of processing power and RAM, massive storage that can operate with incredible performance and scale and of course, my favourite: a high throughput, lossless, low latency network for it all to run over.
A TLM may only require a few hundred Megabytes to Gigabytes of training data to learn to give the desired output, but its scope of operation is very very limited.? An LLM by contrast is intended to be able to operate across Terabytes of data.? All of this data has to be ingested and processed through a deep and extensive network of modelled neurons interconnected via sets of parameters called ‘weights’ and tuning knobs called ‘biases’.? An LLM such as GPT-3 has as many as 1.76 billion parameters and GPT-4 is estimated to operate with trillions.? The sophistication of these models is breathtaking but in order for them to reach a critical mass of operation and spit out answers in near real time, it has dire consequences for the amount of power and therefore energy needed to run the super-high performance infrastructure they sit on top of.
To be useful, the LLM is first trained, over weeks with millions of backpropagation feedback loops honing the accuracy of the output.? The resource requirements for this training period are eye-watering if you wish to create output within your lifetime.? To truncate the learning and optimisation periods requires massive parallelism in processing.? For example, GPT-3 took around 60 days to learn with its 1.76 billion parameters.? If that had been done on a single GPU, estimates place the computation time as 288 years.
Ref: [https://arxiv.org/pdf/2104.04473 ].??
The solution:? scale out to 1000+ GPUs and build a supporting network and storage array to supercharge the whole process.? Here is an illustrative example to get within the orders of magnitude we are dealing with and note it ignores a number of things including Heating, Ventilation and Cooling (HVAC) requirements.
e.g.
If using popular systems of the GPT-3 period with 8 x GPUs (you can guess yourself what the system is) let’s be generous and assume we are at near idle overall to get power budget for 1 GPU in the system being used which is estimated at 720 W.
720 W x 24 hrs x 365 days x 288 years = 1816.47 MegaWatt-hours [of Energy usage]
TOTAL: ~ 2850.59 MWh?
JOB COMPLETION TIME: 288 YEARS
If we compare that to scale-out using 1000 GPUs we can truncate the training to ~60 days.? Again, using contemporary systems with 8 x GPUs per node; that would be 125 nodes total.
Creating an AI setup with 125 nodes is non-trivial and we need to account for a scale-out high-performance infrastructure too that will be able to accommodate non-blocking transfer of the blur of data that will need to pass between all the GPUs to converge on the output.
So, 1000 GPUs, using 8 GPU nodes, each needing 3200 W = 3200 W x 125 nodes x 24 hrs x 60 days = 576 MWh [of Energy usage]
TOTAL: ~ 589.10 MWh
JOB COMPLETION TIME: 60 DAYS
?
THE CAKE OF DIMINISHING RETURNS
From this little exercise, we can see that scale-out approaches intending to reduce the Job Completion Time (JCT) can also massively reduce the overall energy required.? Great, so we can just keep adding GPUs and make things more time and energy efficient, right?
Wrong.? Remember that aphorism that ‘you can’t always have your cake and eat it’?? Each? setup has a sweet spot and a limit, depending on the amount of data you need to process, and the AI model and its parameter space.? There will always be a limiting factor in the system with one of the major ones being the percentage of computational steps that can be run in parallel.? In fact, there are always some steps that need to operate with dependencies meaning you have to drop down to good old fashioned sequential processing (particular order of processing) or you hit some machine limit which drives serial processing (one piece of data at a time) somewhere in the mix.? More generally it can be seen that the accelerative gains by adding more GPUs slows in a curve of diminishing returns and is known by something called Amdahl’s Law.
POWER-SUCKING ALIENS
According to the International Energy Agency (IEA), AI is not the only power-sucking alien in our part of the galaxy, so too is crypto-mining where proof-of-work blockchain calculations are the necessary evil (please, I won’t get into ‘crypto: should we shouldn’t we?’ - I shall leave that one to Netflix to help arbitrate).? Basically, the cumulative effects of global cloud DC scaling, AI and Crypto don’t paint a rosy picture with potential for near doubling of energy demand by 2026 to over 1000 TWh.
IN THE MIDDLE OF DIFFICULTY LIES OPTIMISM
LLM-type Artificial Intelligence is a truly ground-breaking technology, inspired by trying to reverse engineer the language-related aspects of our own intelligence and what is meant by ‘understanding’.? Critically however, the energy demands of scaling AI pursuits are one of the more pernicious aspects of this supremely-important technology.? AI is really driving huge increases in energy demand that will continue for some time to come as its insertion into all parts of our lives becomes the norm.? AI-related endeavours will be focused on the really hard problems of our time and this will be essential to ensure it forms an important part of the solution rather than part of the problem.? On this, I remain optimistic.
Director EMEA Network Data Sales
5 个月Very interesting Richard.
IT Manager na Global Blue Portugal | Especialista em Tecnologia Digital e CRM
5 个月Fascinating! The difference in power consumption between human brains and AI models raises crucial questions about intelligence and energy impacts. The future of AI and energy efficiency is definitely worth exploring further.
Strategy | Transformation | Governance | Monetization | Mergers & Acquisitions
5 个月Thought provoking write-up Richard Balmer! Thinking. Wood lodge. Pen. Paper. Writing.