Understanding Nvidia: inside the AI foundry
AKA: what does Nvidia do and why does it matter?
I was lucky enough to spend a day with Nvidia last week, with the team from Mars (thanks for the invite Gulen Bengi and Rankin Carroll ). It was a mind-expanding and exciting day.
As a nerd (in case you hadn’t guessed), I've followed Nvidia for a while. They made the graphics cards (GeForce) that I dreamt of having for PC gaming as a kid.
The GPU chips they initially created for graphics turned out to be almost unexpectedly good at both mining cryptocurrency and AI problems, so have been growing for the past couple of years as the world of AI accelerated.?
And now they are one of the most valuable companies in the world, with only 35,000 employees, that most people haven’t really heard of, and even fewer people really understand what they do.
A lot has been written about the share price increases, the incredible revenue growth and profitability of Nvidia. Pieces like this: https://ft.pressreader.com/article/281887303242986 from the FT are a great read on that topic.
But, it can be difficult to understand exactly what this acceleration of AI could actually do...
So if you've heard a lot about the rocketing share price, but have no idea what they actually do or why folks seem so excited about them, this piece will attempt to explain what it is that (I think) they do, how to frame it, and a bit of the vision for an AI powered future that is driving their extraordinary valuation and excitement.
(If you are technically minded, or work at Nvidia, maybe avert your eyes, as I will describe it how I understand it, in probably inaccurate and grossly simplified terms).
What is NVidia? Meet the AI Foundry.
Nvidia have described themselves as the AI Foundry. A foundry is defined as a place for casting metal. But I didn't find this an entirely enlightening explanation until I spent some time with them.
The analogy I find myself reaching for is potentially unflattering but helpful. NVidia are the builders merchants, the maker of parts and raw materials.
And they have an impressive selection of extremely good parts for AI construction projects.
The first ‘part’ they make is hardware - powerful GPUs (Graphical Processing Units) that are basically very good at doing a certain type of calculation needed by AI (and graphics engines). The chips they make are currently somewhat unique and the best at what they do. (To torture the analogy, this is the Italian marble you might want to build your house out of).
This is the H100 processor, assembled into a supercomputer, the DGX, which is the hardware behind training a lot of the foundation models like GPT-4, Stable Diffusion and more recently Sora.
One of these bad-boys costs about half a million dollars, and NVidia make the hardware, and the infrastructure to make it work. (This gets technical but alongside the hardware, you need the networking software that allows you to connect them together and operate thousands of these cores, effectively and without overheating. It's the core material you need to run powerful AI compute.)
Companies like OpenAI or Meta have thousands of these, networked together in special facilities, running off renewable energy (where possible) and running cooling systems to keep it working, while it does billions of tasks as part of the AI model building process.
But, many other companies can access this remotely - you can rent rack space to use one of these, serviced by NVidia, to do a wide range of compute heavy tasks. So AI companies big and small turn to Nvidia for this kind of service.
From supercomputers, to the Edge, with Jetson
Beyond this, they have a range of other hardware options, which allow for different situations of AI deployment. For example, they have a chip called the Jetson, which is small, affordable, and optimised to run low cost, low power & temperature, high compute AI services.?
This enables the next buzzword you might start seeing around ‘Edge AI’. A lot of what we do with AI (and computers in general), we access remotely through a cloud service. Your laptop doesn’t run an AI model, your browser asks an AI model in the cloud to do a thing and send it back. That’s what the 'wait' when you prompt in Midjourney is. And why when we lose the internet, we are cut-off from our AI tools.
However, with the right setup you can run (some of) these models locally on a home computer, or, in an edge device. Examples of this include AI powered self-navigating drones, self-driving cars, AI enabled retail environments and a lot of utility and services people are building around the world. If you want an AI widget that can spot problems in a moving vehicle, critical infrastructure, the transport network, or a hospital, you’re going to want locally running hardware.
These chips run the models, and potentially refine and fine-tune - they don't make the foundation models.
If you lose WiFi while you’re making a fun image Midjourney - no drama. But you don’t want your self-driving car to lose connection to its supercomputer as you approach a bend. And you if we’re using medical AI in a hospital, you don’t want it vulnerable to power cuts, WiFi cuts, cyber attacks or slow responses. So for secure, low latency use cases, you’ve gotta go to the Edge.
There's a plethora of different hardware options between the DGX to the Jetson, designed to power a whole infrastructure of AI use cases (which I'll come onto later).
The other ‘parts’ they make are models, services, micro-services.
This is more than just the hard compute power. Nvidia's business is increasingly growing into AI models, services and micro-services.
A model is a custom AI model that you can build your system around. In the same house-building analogy, the model could be the central heating system or boiler. It comes as a pre-built solution to a problem, that you plumb into the rest of the building.
Services are like plumbing pipes and electrical wires. They let you connect together different parts of the whole. And a micro-service would be a special widget or connector to do a little important task.
So it’s useful to know what is possible (and what’s not) with the different services and models, but this extra software level is another major part of Nvidia’s business.
The important thing this analogy is leading to, is what Nvidia isn’t. They’re not the architects of your final house. They’re not the builders or crafts-people who put it together. For that, Nvidia relies on a whole selection of partners and people to use what they’ve got.
Major AI companies like OpenAI are building huge ambitious complex projects. But all companies are starting to figure out what they might need. This will need an architects vision, and a clear understanding of what we’re trying to achieve and how it’s going to be used, for each different business and company.
Putting these solutions together, and into practice, takes a set of different expertise. And it's still early days. Earlier this month, Nvidia CEO and Founder Jensen Huang said he believed that in the next 4-5 years, we'll need another trillion dollars investment in data centre infrastructure and hardware to make it all possible.
“You lost me somewhere there, but I’m going with it… but what is it all for?â€
It also gets complex figuring out what you might want to build with all this. Like Lego or electricity, it’s something that can be used to do almost anything, so it can be hard to wrap your head around.
Within the models and services, there’s a lot of different ‘superpowers’ that can be plugged together.
- Computer Vision: Computers seeing/sensing and understanding the world - from recognising people, cars, objects, animals, to being able to detect gestures or facial expressions or track the current state of the shelves in a supermarket
- Digital Twins: Creating linked virtual and physical twins, to allow for experimentation and prediction
- Predictive Engines: ability to predict ‘what next’ based on previous data and modelling?
- Language comprehension: ability to engage with computers in natural language, with them understanding and responding - the core of many LLMs
- Language translation: fluency and ability to work across many different languages
- Language expression: ability to create audio speech or write speech in a specific style, in any language?
- Document / Data / Context Learning: ability to synthesize, digest and hunt through documents to find themes and specific information (so called ‘needle in a haystack’ problems - and a challenge that LLMs are starting to get better at)
- Image Generation / Modification: creating visual outputs, based on a text or image prompt. This can also include in-painting (changing what’s in an image) and out-painting (extending an image out), alongside style transfer (reworking an image in a different style). But importantly also being able to composite an image together with the right visual semantics, reflections, shadows, lighting?
- Video Generation / Modification: similar to the above, but an important additional feature of models like Sora is that they seem to model objects and physics - so it needs to accurate generate a world, when if you have a rock dropping in a pond, the ripples and reactions look right
So, you can start to see why the possibilities sprawl out in-front of us. If these are all the things available from the AI foundry - what do we want to build?
Digital Twin Prototyping, Prediction and Prevention:
Digital Twins have been around as language for a while but it can be difficult to really appreciate the application and potential.
David Gelernter wrote about these 'mirror worlds' in 1993, which, in a visionary way, imagined how our virtual and digital worlds could be twinned and how that might transform the way we live.
If you’ve ever played Rollercoaster Tycoon or Farm Simulator games, these are digital twins where you can practice and experiment. Want to find out what happens if you don’t put breaks on your rollercoaster without killing real people? Digital twins are like this, but the closer the simulation is to the real world, the more useful and instructive it can be.
Using a range of techiniques, from LiDAR scanning, to design files, to manual re-creation, and blending video and photography, tools like the Nvidia Omniverse offers an environment to build accurate digital twins or the real world.
In a digital twin that accurately maps to the real world, enough data and inputs mean that it will follow the same physics and behaviours of our world.?
The level of realtime connection and accuracy lets you do some interesting things.
First, you can prototype and experiment. Want to know what would happen if you changed you packaging to your factory and supply chain? Test it out in a digital twin simulation. Turn up the speed on the conveyor belt. Test things you couldn’t or wouldn’t do in a physical factory.
It can also be used to model and predict problems - want to see how the building would evacuate in a fire? How about if there’s a stack of boxes in the way of the fire exit?
You can also use these virtual simulations to train an AI to spot problems - an example from warehouse technology NVidia shared looked at seeing if people were lifting boxes safely, and whether they were choosing safe loads. You can model 1000 people slipping and falling in your factory, so your systems can recognise it (and prevent it).
And it can also run simulations, continually, mapped to the real world. This is part of autonomous driving systems - where a digital twin of the world is continually updated, and the model predicts and pre-empts. What would happen if that person stepped into the road? Or that tree fell? Or lorry tipped over? It can predict and model different outcomes and choose the best possible action. It’s sort of how humans work, but potentially doing it with greater infallibility.
领英推è
This piece: https://takes.jamesomalley.co.uk/p/tfls-ai-tube-station-experiment-is is a fascinating explanation of this put into practice by TFL in a Tube Station and shows some of the potential of this. (Thanks to Zoe Scaman and Stephan Pretorius who both shared this).
Ultimately the more accurately we can map the world, recreate it virtually, and then accurately and meaningfully experiment in that digital twin, the more solutions we’ll find. From medical outcomes to weather predictions, to global logistics and finance.
Rather than screwing up in the real world, where we only get one chance, let’s screw up in a digital world repeatedly, until we get it right. That’s partly how AlphaGo became the world’s best at chess. It had the most practice, simulating more games than all of humanity have ever played.
Digital World Production of Content:
These digital worlds don’t need to just map to our physical world - it can also lead to a fully virtual content production environment, allowing us to create content, assets, stories, in infinite variety and function.
Accurate virtual models of products and brands mean we can make content that is accurate to the real world.
Virtual human cloning or synthesising means we can create inifinite variety of photorealistic people with increasing fidelity and control.
So rather than shooting a movie on a green screen or LED set, and then putting a virtual environment around it, the full scene can be virtual.
And rather than hand-designing and building the world, that can be generated using text and image prompts.
Instead of having to provide feedback to an animator to painstakingly code all the movement, the objects and characters in this environment can be ‘directed’ in natural language. “Do it again but with more swaggerâ€.
So you will soon be able to describe your ideal car in natural language, have it configured for you, and then see what it would look like driving down your street or parked in your garage. (This is what WPP have been building in partnership with Nvidia: https://nvidianews.nvidia.com/news/wpp-partners-with-nvidia-to-build-generative-ai-enabled-content-engine-for-digital-advertising )
Robotics Training and Deployment:
Reinforcement learning methods can create the ability for robots to learn by themselves through trial and error. Similar to how humans learn, the way we learn the cooker is hot and water is wet and what happens when we drop a plate, are through supervised and unsupervised reinforcement learning. We drop stuff and see what happens. Our parents tell us not to touch the stove, and we usually do anyway and learn the hard way. Robots can do the same, with AI built into them.?
However 500lb+ industrial robots aren’t given as much leeway to learn by smashing around like a toddler. So virtual worlds allows us to practice in a virtual world and apply it in the real world. Project Eureka within Nvidia is looking at this - with repeated simulations or virtual world learning informing how physical robots can operate.?
(The Nvidia Voyager and Eureka Projects, which I've followed mostly via Jim Fan and what he posts on LinkedIn are fascinating research projects in this space: https://blogs.nvidia.com/blog/eureka-robotics-research/ and https://blogs.nvidia.com/blog/ai-jim-fan/ which see AI used to train robots to spin pens, and autonomous agents to learn how to play Minecraft).
This seems a hugely exciting and promising route to accelerate the capability of robots while managing risks. It also opens up cybernetics, and symbiotic exoskeletons. Having an AI running a digital twin of the environment can work much more seamlessly with a human operator. Things that were, until recently, sci-fi, are now looking like solvable challenges.
Intelligent Models & Assistants for Various Domains:
The future won’t be led by one giant LLM that everyone uses. The answer isn’t just that we’ll all use either Gemini or GPT. It’s not like picking Firefox vs Chrome. Instead, there’s a boom coming in Large and Small Language Models tailored to specific tasks.?
A Small Language Model can be more focused - rather than detailing or giving rogue open ended answers, a narrow model can provide specific advice on things like a companies HR policies, or how your car works, or your financial advisor. You don’t need a multi-purpose model who can write you a haiku about derivatives and ETF funds, you need an accurate up-to-date financial assistant who doesn’t hallucinate.
This is also where the concept of Sovereign AI is interesting - rather than a global model that has to understand and switch between considering every cultural nuance in the world, a FranceGPT vs a MexicoGPT can be narrowed to specific context, cultural nuance and language and as a result be more accurate, intuitive and efficient.
It’s easy to imagine a world where every company has its own model, fine tuned on their domain, that can work on lots of different tasks and solutions. Or, in fact, maybe many connected models networked together.
Virtual Character Generation: Nvidia have build a set of models and services that compose ACE: Avatar Cloud Engine as the toolkit to create accurate, live, realistic (or stylised) avatars with language, reasoning, backstory and personality.?
( https://www.youtube.com/watch?v=psrXGPh80UM - this video demo shows a bit about how it works.)
In the demo we tried, you can have open ended conversations with computer NPCs - I ordered a Spider Ramen from a cyberpunk ramen shop, and the chef replied “Great choice - a deliciously crunchy spider ramen coming right upâ€.?Linking together virtual avatars, backstories, characters and narrative, in realtime.
The potential from this for open ended worlds and storytelling is enormous. (My nerdy brain jumps to sprawling characters enriching Dungeons & Dragons games, where players are taken wherever they choose, guided by an AI powered Dungeon Master). Startup, Convai is built around this capability: https://www.convai.com/ and a future of open ended, immersive storytelling and gaming feels pretty exciting.
This opens up huge opportunities for gaming (most obviously) but also therapy, education and training, customer service, interactive experiences.?
It’s not possible for humans to pre-write infinite stories, but with these tools, LLMs can.
Autonomous, Anticipatory and Adaptive Spatial Tech:
I also believe that over the next 10 years we’re going to shift away from our 3 screen paradigm and the reliance on the big / medium / small screens in our lives.
Part of this is a need for technology that understands and adapts to us, without having to stare at your camera, unlock your phone, swipe into an app and tell it what to do.
Spatial tech is currently focused on the virtual or mixed experiences we will have, via new screens and devices. But I think Anticipatory Tech is the UX shift that will really change things.
If our devices and software understand language, voice, tone and gesture, we will have a whole new way of engaging with the machines around us. This, to me, is exciting. Not to say we’re saying goodbye to our smartphones, but that we might find we’re pretty close to ‘peak screen’ and we might evolve beyond them.
There are doubtless countless more areas and opportunities. This is just a snapshot of the things that might be done with an 'AI foundry' like Nvidia.
Figuring it out and doing it is the next challenge.
Will it pan out exactly this way? Probably not. And there’s definitely a chance that there’s some short-term overhype. Things might slow, dip or burst in the next couple of years. But I think the analogy of the 2000s dotcom bubble is instructive. Not because the bubble burst, but because it was right in the long run.
“We tend to over-estimate technology in the short term, and under-estimate it in the long run†- Roy Amara, director for the institute of the future in the 1970's.
It's incredibly exciting that all of this is possible (broadly) with tech we have today. Making this all actually happen is more down to a couple of key factors (and the major factor being time, as it’s moving at a breakneck pace and we’re all chasing to keep up).
Some of the barriers are technical - more compute, more accuracy, more fine-tuning. Some of the chaotic, complex systems may prove impossible to fully model (the weather is an elusive beast for example!)
But the bigger challenges are going to be:
Society: this is a big shift. And folks generally don’t like change. It will also disrupt a lot of how the world works - potentially for the better, but not universally for the better. I choose to be cheerful rather than fearful, but that’s not to be naive about the legal and responsibility questions, let alone the societal ones.
Imagination: the sky's the limit - but also it takes time to step outside the everyday to look at the sky and dream. Historically, sci-fi and fiction has given us a good blueprint to go and build. We need to keep doing that, and keep thinking big and ambitious.?
Regulation: legal liability, indemnification, insurance and risk are all huge topics. We might be able to build an incredible medical surgery robot, but we also need to get it legally approved for use.
Data and privacy: good quality data will make a lot of these digital worlds possible. And, thankfully, AI brings some tools to help with categorisation, tagging and data cleaning. But it’s not a miracle cure. If we don’t measure it, we can’t model it. And, conversely, there are huge privacy challenges around some of the data capture and usage that could unlock great digital two utility. Balancing risk and privacy vs potential benefits is no easy task
Legacy Systems: the sunk cost fallacy and status quo bias will also come into play. All of this takes effort. Systems change slower than we like to pretend. We’re still halfway through the past decade of digital transformation. And countries, businesses and citizens don’t have the resource and effort lying around to rip everything up and start again. Multi-trillion dollar investments in infrastructure don't come easier. (And there are other things competing for that investment).
Ultimately, the new building blocks we can use with AI have only existed for an incredibly short time. We are at the tip of the iceberg of whatever is going to happen. In the same way, using Altavista or Netscape for the first time, it was impossible to predict where we’d be 25 years later, it feels to me we’re at a similar starting point.
Or I could be wrong, maybe it’s all a fad and it will go away. In which case I’ll happily eat my (Nvidia branded) hat.
Either way, hopefully an illuminating tour through what I found exciting about Nvidia. Their keynote and annual conference is on the 18th March and it will doubtless have some surprises and revelations. https://www.nvidia.com/gtc/keynote/
Thanks for hosting us, and the hospitality, Cynthia Countouris and Richard Kurtzer .
Senior Marketer | Proud People Leader
1 å¹´Great insights as usual, but you're writing style is what stands out to me. 'I choose to be cheerful rather than fearful' is beautifully stated and a sentiment I certainly share with these advancements. The inclusion of Roy Amara's quote grounds things nicely as well - still as relevant today as it was back then. Looking forward to your future posts so thanks for taking the time to summarize your continued experiences in the field.
WPP Global Strategic Partnerships | Focused on the intersection of Digital Commerce, Retail Media, AI, and Ad-tech
1 å¹´Oli, this is great and well-written summary that helps convey this topic for many of our teams and clients in practical layman's terms! ??
Marketing | Strategy | Creativity | Mental Health
1 å¹´Cheers Oliver Feldwick
Associate Creative Director ? Senior Art Director ? Creative
1 å¹´Great content Oliver. Thanks for sharing