Tiny but Crucial Details of GPT-4o and Why Leaders Need Deep Technical Understanding
As a teacher in executive AI subjects, a common question I encounter is whether a non-technical person could end up as an AI or tech lead. My answer is nuanced, and the GPT-4o presentation could help to understand it.
GPT-4o is a huge improvement over GPT-4, mainly due to: a) its real-time voice and image/video conversation mode, b) multilingual capabilities, and c) cheap price. If you didn't see the presentation, I recommend that you watch a few demos. Realtime translation or conversational speech would be a good starting point.
How does a voice chatbot work?
In its mobile version, the previous GPT-4 already had a voice mode. It uses text-to-speech (TTS) and a speech-to-text (STT) systems connected to a GPT-4 (text-only model) to build this voice-based experience. But what we saw in GPT-4o's demo is a multimodal AI that simultaneously combines audio, image, and text inputs. The "o" of GPT-4o stands for omni.
GPT-4o text-only model was released the same day of the presentation but not the full version. So a lot of users got confused. Nothing to blame here, they are really similar. That is the point of this article: the nuances you can sense with technical knowledge.
Both are so similar that Sam Altman clarified this week on Twitter that this is not released yet.
Tokens are the building blocks of language models. A tokenizer is trained to break down text into smaller units called tokens. Let's see an example.
This concept can be extended to voice, image, or video. For example, a sound is composed of many audio tokens: short fragments of audio.
GPT-4o stands out for its ability to work with multiple types of input and output tokens, such as text, images, and sound. This is what we call multimodality.
Because GPT-4o is generating audio tokens, not text that is sent to a TTS system, it can emulate laughter in a natural and very convincing way , even singing, but also creates strange voices, especially with non-English voices. GPT-4o is a totally different product, although the UX is similar to the previous GPT-4 voice system.
It seems like it "thinks in real-time". It is so fast and works directly with sound, so the latency is minimal between the user's ask and the answer. This, on top of interruption detection, makes GPT-4o quite magical.
But note that older alternatives like GPT-4-turbo were fast enough to give a real-time experience before GPT-4o was released. See vapi.ai as an example of real-time conversation using a more traditional STT → Gen AI model → TTS approach.
Multimodality comes with some challenges: training data is scarce and there are more cybersecurity risks. See how GPT-4o is jailbroken into breaching OpenAI's safety rules with only one image (published one day after the release).
Another important point will be the price. Model use is priced by tokens. Because images, voice, and text are all tokens, we need to know how many tokens an image or a voice recording spends.
So even though the announcement highlights their price reduction, we don't yet know how expensive GPT-4o will be for voice chatbots compared to a traditional approach. I would not bet a lot on either option. GPT-4o is cheap, but sound will spend more tokens, and STT/TTS services are getting really good and inexpensive by the month.
And what about technical leadership?
This is what happens when you have enough technical understanding:
We cannot compare with Andrej Karpathy, but this is a great example of what technical knowledge brings to the table and how that helps our decision-making. Let me go through an idealized thought process under two different lenses:
Standard takeaway
Vs Deep knowledge takeaway
As you can see, the technical point of view can add a lot of value to your decision-making.
The point is not: "tech folks are better," but that you need to really understand both sides of the game in a deep way to get the right insights. This idea clashes with some common attitudes among decision-makers. More on that later.
Deep knowledge makes you more prudent, generates more nuances, and leads to more complex insights. You are seeing the Dunning-Kruger effect in action.
It's no surprise that seasoned tech leads tend to prefer old and proven technologies and don't blindly follow the new flashy thing.
It's also no surprise that "quick-and-simple" analyses or "bite-size" content targeted at executive audiences tend to be extremely poor and trigger bad decision-making. You cannot fit nuances into a two-minute reading report. Another topic for another day.
Multilingual Capabilities
The GPT-4o demo was really strong in multilingual capabilities. OpenAI used far more training data in non-English languages. This improves the performance of the model, but something really important was missed: tokenizers.
Tokenizers penalize less popular tokens, therefore non-English languages. See an example tokenizing "cien ca?ones por banda " in Spanish and English with the old tokenizer:
Note that "cannons" is one token in English, but 3 in Spanish ("ca/?/ones"). This has huge implications. It is so important that it was highlighted by Greg Brockman (co-founder and president of OpenAI).
领英推荐
Fewer tokens means cheaper inference costs. In some languages up to 4x fewer. But this improved tokenizer also improves the performance of GPT.
GPT (and any other AI based on Transformers) has an attention mechanism. This combines tokens to give them meaning according to the context as excellently explained in this Financial Times piece : "interest rate" is different than "I have no interest in politics" although "interest" is the same token in both phrases.
In languages other than English where words are split into two tokens, the attention mechanism needs an extra effort to "join" those tokens:
Therefore, by changing the tokenizer to be multilingual-friendly, we have just achieved:
There are a few interesting technical takeaways about this, but I will keep them for the "A Final Exercise" section.
So if deep tech knowledge is so useful, why is it not so widespread among decision-makers? Many reasons, but I will highlight one: a cultural problem.
The two culture's problem in technical leadership
There is a persistent attitude of contempt towards technical knowledge from "business" or "strategy" roles and vice versa. This is really harmful to organizations. In my experience, this attitude could take three forms:
1. Strategists vs. worker bias: "I am a leader, I don't know how to program/develop/do this stuff because I make strategy/think big."
Leadership or strategy defined as the opposite of implementation is a bad idea. Better-informed leaders make grounded decisions, and their stakeholders and teams benefit from that. HBR's If your boss could do your job, you're more likely to be happy at work is a good resource that explains more about this.
2. Job description strictness: "This is not what I have to do."
That is (partially) true. But knowing how it is done and its fundamentals is part of a leader's job. This shapes the market and its behavior and overlaps with a "pure" technical job description.
3. Opportunity cost blindness: "I have no time to learn all those things."
Which really means technical knowledge is not perceived as valuable enough to prioritize it. I hope this article helps you see how much can you get from getting into technical stuff.
Navigating hyped waters
So, how technical should technical leadership roles be? There is no perfect answer, but we can focus on what happens if they are not technical enough.
Do you remember the 2014-15 Big Data hype? I remember. I remember terrible bets and purchases of expensive solutions that were thrown away in one or two years when the main stakeholder left the company or the increasing cost did not justify more investment.
We should learn from those crazy years because it is happening again.
Deep knowledge is the most important skill for technical decision-makers in an overhyped market. It will save you a ton of money and a few bad surprises.
Gen AI is different from other technological shifts: more complex, faster, and probably more impactful.
There will be no manuals or use cases to copy and paste in the short term, and they will be outdated really fast. Worse yet, there may be a lot of them trying to sell you things.
In early-stage phases of innovation like this, exploration, MVPs, iterative testing, internal know-how (yours!), and expectation management are key elements of success where traditional consulting struggles to deliver. This is a good topic for another article.
Learn how to develop with Gen AI, especially if you are not a developer.
You should learn things that feel useless at first, but they are worth it when you dodge a bullet of a bad use case, save 10x the price of an over-engineered solution, or reduce your team's rotation.
Gen AI has a nice characteristic: it is quite no-code friendly. You can understand a lot of its fundamentals without getting lost in other details.
Prudence is a side effect of knowledge. You will start to discard some perfectly presented and marketing-shiny ideas. Congratulations, now you have criteria.
Some peers won't understand that. Use this to your favor; explain your reasoning. Show your deep know-how. It is a clear differentiator, but only if you can explain it. And as a teacher, trust me, you need to understand something in order to explain it.
A final exercise
Let's sum up the main insights we got from the GPT-4o presentation through my "technical lenses" - though I may be mistaken. Try to determine how many of these insights you would have discovered on your own and which ones required more technical nuance for you to get into.
Don't worry. It is not important how many of them you would have discovered. The goal of this article is to help you understand in which way technical and business knowledge should be combined to make a good leader and strategist.
It is not possible to know how much you don't know. So learn.
So learn. Even apparently out-of-your-scope things. Especially that. Mix tech and "business" know-how. It will not hurt and probably will help a lot.
Was it interesting? Did you learn something new?
This is the first article in the "Tiny but Crucial" series. For the next one, I could use your help. I will leave a short survey (two questions) in the comments to analyze another strong bias in AI-invested decision-makers.
I have left a few interesting topics for future articles. Let me know what you would love to read, and follow me if you want to see more of this.
Generative AI Development & Strategy
6 个月Two-question survey. It will help me gather some real data to explain a bias to decision-makers. I will release the data and the analysis in a few days. Thanks! https://tally.so/r/3xd595
Comms | Journalism | Data
6 个月Super interesting!! In my case, I wasn’t aware of words per token density issue, and the implications this may have. Waiting to reading all these topics you are highligthing for future posts!